XML/Stax

Jackson 2.0: now with XML, too!

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released")

While Jackson is most well-known as a JSON processor, its data-binding functionality is not tied to JSON format.
Because of this, there have been developments to extend support for XML and related things with Jackson; and in fact support for using JAXB (Java Api for Xml Binding) annotations has been included as an optional add-on since earliest official Jackson versions.

But Jackson 2.0.0 significantly increases the scope of XML-related functionality.

1. Improvements to JAXB annotation support

Optional support for using JAXB annotations (package 'javax.xml.bind' in JDK) became its own Github project with 2.0.

Functionality is provided by com.fasterxml.jackson.databind.AnnotationIntrospector implementation 'com.fasterxml.jackson.module.jaxb.JaxbAnnotationIntrospector', which can be used in addition to (or instead of) the standard 'com.fasterxml.jackson.databind.introspect.JacksonAnnotationIntrospector'.

But beyond becoming main-level project of its own, 2.0 adds to already extensive support for JAXB annotations by:

Making @XmlJavaTypeAdapter work for Lists and Maps
Adding support for @XmlID and @XmlIDREF -- this was possible due to addition of Object Identity feature in core Jackson databind -- which basically means that Object Graphs (even cyclic ones) can be supported even if only using JAXB annotations.

the second feature (@XmlID, @XmlIDREF) has been the number one request for JAXB annotation support, and we are happy that it now works.
Canonical example of using this feature would be:

    @XmlAccessorType(XmlAccessType.FIELD)
    public class Employee
    {
        @XmlAttribute
        @XmlID
        protected String id;
     
        @XmlAttribute
        protected String name;
     
        @XmlIDREF
        protected Employee manager;
     
        @XmlElement(name="report")
        @XmlIDREF
        protected List<Employee> reports;
     
        public Employee() {
            reports = new ArrayList<Employee>();
        }
    }

where entries would be serialized such that the first reference to an Employee is serialized fully, and later references use value of 'id' field; conversely, when reading XML back, references get re-created using id values.

2. XML databinding

Support for JAXB annotations may be useful when there is need to provide both JSON and XML representations of data. But to actually produce XML, you need to use something like JAXB or XStream.

Or do you?

One of experimental new projects that Jackson project started a while ago was something called "jackson-xml-databind".
After being developed for a while along with Jackson 1.8 and 1.9, it eventually morphed into project "jackson-dataformat-xml", hosted at Github.

With 2.0.0 we have further improved functionality, added tests; and also worked with developers who have actually used this for production systems.
This means that the module is now considered full supported and no longer an experimental add-on.

So let's have a look at how to use XML databinding.

The very first thing is to create the mapper object. Here we must use a specific sub-class, XmlMapper

  XmlMapper xmlMapper = new XmlMapper();
  // internally will use an XmlFactory for parsers, generators

(note: this step differs from some other data formats, like Smile, which only require use of custom JsonFactory sub-class, and can work with default ObjectMapper -- XML is bit trickier to support and thus we need to override some aspects of ObjectMapper)

With a mapper at hand, we can do serialization like so:

  public enum Gender { MALE, FEMALE };
  public class User {
    public Gender gender;
    public String name;
    public boolean verified;
    public byte[] image;
  }

  User user = new User(); // and configure
  String xml = xmlMapper.writeValueAsString(user);

and get XML like:

  <User>
    <gender>MALE</gender>
    <name>Bob</name>
    <verified>true</verified>
    <image>BARWJRRWRIWRKF01FK=</image>
  </User>

which we could read back as a POJO:

  User userResult = xmlMapper.readValue(xml, User.class);

But beyond basics, we can obviously use annotations for customizing some aspects, like element/attribute distinction, use of namespaces:

  JacksonXmlRootElement("custUser")
  public class CustomUser {
    @JacksonXmlProperty(namespace="http://test")
    public Gender gender;
    @JacksonXmlProperty(localname="myName")
    public String name;

    @JacksonXmlProperty(isAttribute=true)  
    public boolean verified;
    public byte[] image;
  }
  
  // gives XML like:
  <custUser verified="true">
     <ns:gender xmlns:ns="http://test">MALE</gender>
     <myName>Bob</myName>
     <image>BARWJRRWRIWRKF01FK=</image>
   </custUser>

Apart from this, all standard Jackson databinding features should work: polymorphic type handling, object identity for full object graphs (new with 2.0); even value conversions and base64 encoding!

3. Jackson-based XML serialization for JAX-RS ("move over JAXB!")

So far so good: we can produce and consume XML using powerful Jackson databinding. But the latest platform-level improvement in Java lang is the use of JAX-RS implementations like Jersey. Wouldn't it be nice to make Jersey use Jackson for both JSON and XML? That would remove one previously necessary add-on library (like JAXB).

We think so too, which is why we created "jackson-jaxrs-xml-provider" project, which is the sibling of existing "jackson-jaxrs-json-provider" project.
As with the older JSON provider, by registering this provider you will get automatic data-binding to and from XML, using Jackson XML data handler explained in the previous section.

It is of course worth noting that Jersey (and RESTeasy, CXF) already provide XML databinding using other libraries (usually JAXB), so use of this provider is optional.
So why advocate use of Jackson-based variant? One benefits is good performance -- a bit better than JAXB, and much faster than XStream, as per jvm-serializer benchmark (performance is limited by the underlying XML Stax processor -- but Aalto is wicked fast, not much slower than Jackson).
But more important is simplification of configuration and code: it is all Jackson, so annotations can be shared, and all data-binding power can be used for both representations.

It is most likely that you find this provider useful if the focus has been on producing/consuming JSON, and XML is being added as a secondary addition. If so, this extension is a natural fit.

4. Caveat Emptor

4.1 Asymmetric: "POJO first"

It is worth noting that the main supported use case is that of starting with Java Objects, serializing them as XML, and reading such serialization back as Objects.
And the explicit goal is that ideally all POJOs that can be serialized as JSON should also be serializable (and deserializable back into same Objects) as XML.

But there is no guarantee that any given XML can be mapped to a Java Object: some can be, but not all.

This is mostly due to complexity of XML, and its inherent incompatibility with Object models ("Object/XML impedance mismatch"): for example, there is no counterpart to XML mixed content in Object world. Arbitrary sequences of XML elements are not necessarily supported; and in some cases explicit nesting must be used (as is the case with Lists, arrays).

This means that if you do start with XML, you need to be prepared for possibility that some changes are needed to format, or you need additional steps for deserialization to clean up or transform structures.

4.2 No XML Schema support, mixed content

Jackson XML functionality specifically has zero support for XML Schema. Although we may work in this area, and perhaps help in using XML Schemas for some tasks, your best bet currently is to use tools like XJC from JAXB project: it can generate POJOs from XML Schema.

Mixed content is also out of scope, explicitly. There is no natural representation for it; and it seems pointless to try to fall back to XML-specific representations (like DOM trees). If you need support for "XMLisms", you need to look for XML-centric tools.

4.3 Some root values problematic: Map, List

Although we try to support all Java Object types, there are some unresolved issues with "root values", values that are not referenced via POJO properties but are the starting point of serialization/deserialization. Maps are especially tricky, and we recommend that when using Maps and Lists, you use a wrapper root object, which then references Map(s) and/or List(s).

(it is worth noting that JAXB, too, has issues with Map handling in general: XML and Maps do not mesh particularly well, unlike JSON and Maps).

4.4 JsonNode not as useful as with JSON

Finally, Jackson Tree Model, as expressed by JsonNodes, does not necessarily work well with XML either. Problem here is partially general challenges of dealing with Maps (see above); but there is the additional problem that whereas POJO-based data binder can hide some of work-arounds, this is not the case with JsonNode.

So: you can deserialize all kinds of XML as JsonNodes; and you can serialize all kinds of JsonNodes as XML, but round-tripping might not work. If tree model is your thing, you may be better off using XML-specific tree models such as XOM, DOM4J, JDOM or plain old DOM.

5. Come and help us make it Even Better!

At this point we believe that Jackson provides a nice alternative for existing XML producing/consuming toolkits. But what will really make it the first-class package is Your Help -- with increased usage we can improve quality and further extend usability, ergonomics and design.

So if you are at all interested in dealing with XML, consider trying out Jackson XML functionality!

Posted by Tatu Saloranta at Tuesday, March 27, 2012 9:53 PM
Categories: Java, JSON, XML/Stax
| Permalink |Comments | links to this post

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (java.util.zip.Deflater).

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.

  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
      if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
        feeder.endOfInput();
      }
    }
    // and once we have full event, we just dump out event type (for now)
    System.out.println("Got event of type: "+type);
    // could also just copy event as is, using Stax, or do any other normal non-blocking handling:
    // xmlStreamWriter.copyEventFromReader(asyncReader, false);
  } while (type != END_DOCUMENT);
  asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

Posted by Tatu Saloranta at Saturday, March 12, 2011 3:49 PM
Categories: Java, Open Source, XML/Stax
| Permalink |Comments | links to this post

Jackson: not just for JSON, Smile or BSON any more -- Now With XML, too!

(NOTE: see the newer article on "Jackson 2.0 with XML")

One of first significant new Jackson extension projects (result of Jackson 1.7 release which made it much easier to provide modular extensions) is jackson-xml-databind, hosted at GitHub. Although this extension is still in its pre-1.0 development phase, the latest released version is fully usable as is and is even in some limited production use by some brave developers (running on Google AppEngine, of all things!).

So it is probably a good idea to now give a brief overview of what this project is all about.

1. What is jackson-xml-databind?

Jackson-xml-databind comes in a small package (jar is only about 55 kB) , and is used with Jackson data binding functionality (jackson-mapper jar). It provides basic replacement for JsonFactory, JsonParser and JsonGenerator components of Jackson Streaming API, and allows reading and writing of XML instead of JSON, in context of generic Jackson data binding functionality. In addition, core ObjectMapper is also sub-classed to provide customized versions of couple of other provider types, so typically all usage is done by creating com.fasterxml.jackson.xml.XmlMapper instead of ObjectMapper, and using it for data binding.

2. What is it used for?

This package is used to read XML and convert it to POJOs, as well as to write POJOs as XML. In this respect it is very similar to JAXB (javax.xml.bind) package; and an alternative for many other Java XML data binding packages such as XStream and JibX. Given Jackson support for JAXB annotations, it can be especially conveniently used as a JAXB replacement in many cases.

Functionality supported is in some ways a subset of JAXB, and in other ways a superset: XML-specific functionality is more limited (no explicit support for XML Schema), but general data binding functionality is arguably more powerful (since it is full set of Jackson functionality).

Two obvious benefits of this package compared to JAXB or other existing XML data binding solutions (like XStream) are superior performance -- with fast Stax XML parser, this is likely the fastest data binding solution on Java platform (see jvm-serializers for results) -- and extensive and customizable data POJO conversion functionality, using all existing Jackson annotations and configuration options. The main downside currently is potential immaturity of the package; however, this only applies to interaction between mature XML packages (stax implementation) and Jackson data binder (which is also fairly mature at this point).

3. So how do I use it?

If you know how to use Jackson with JSON, you know almost everything you need to use this package. The only other thing you need to know is that there has to be a Stax XML parser/generator implementation available. While JDK 1.6 provides one implementation, your best best is using something bit more efficient, such as Woodstox or Aalto. Both should work fine; Aalto is faster of two, but Woodstox is a more mature choice. So you will probably want to include one of these Stax implementations when using jackson-xml-databind.

Other than this, all you need to do is to construct XmlMapper:

  XmlMapper mapper = new XmlMapper(); // can also specify XmlFactory to 
  use, to override Stax factories used

and use it like you would any other ObjectMapper, like so:

  User user = new User(); // from Jackson-in-five-minutes sample
  String xml = mapper.writeValueAsString(user);

and what you would get is something like:

<User>
  <name>
    <first>Joe</first>
    <last>Sixpack</last>
  </name>
  <verified>true</verified>
  <gender>MALE</gender>
  <userImage>AQIDBAU=</userImage>
</User>

which is equivalent of JSON serialization that would look like:

{
  "name":{
    "first":"Joe",
    "last":"Sixpack"
  },
  "verified":true,
  "gender":"MALE",
  "userImage":"AQIDBAU="
}

Pretty neat eh?

Oh, and reverse direction obviously works similarly:

  User user = mapper.readValue(xml, User.class);

There is really nothing extra-ordinary in it usage; just another way to use Jackson for slicing and dicing your POJOs.

4. Limitations

While existing version works pretty well in general, there are some limitations. These mostly stem from the basic difference between XML and JSON logical models; and specifically affect handling of Lists/arrays. XmlMapper for example only allows so-called "wrapped" lists (for now); meaning that there is one wrapper XML element for each List or array property, and separate element for each List item.

Compared to JAXB (and related to JAXB annotation support), no DOM support is included; meaning, it is not possible to use converters that take or produce DOM Elements.

With respect to Jackson functionality, while polymorphic type information does work, some combinations of settings may not work as expected.

And given project's pre-1.0 status, testing is not yet as complete as it needs to be, so other rough edges may also be found. But with help of user community I am sure we can polish these up pretty quickly.

5. Feedback time!

So what is needed most at this point? Users, usage, and resulting bug (or, possibly, success) reports! Seriously, more usage there is, faster we can get the project up to 1.0 release.

Happy hacking!

Posted by Tatu Saloranta at Monday, February 28, 2011 9:28 PM
Categories: Java, Open Source, XML/Stax
| Permalink |Comments | links to this post

StaxMate 2.0.1 released; improved DOM-from-Stax, compatibility with default JDK 1.6 Stax implementation

Quick update from "XML world" -- in which I have spent much less time, due to explosive growth in JSON land: StaxMate 2.0.1 was just released.

1. StaxMate?

First question you might ask is "What the heck is StaxMate?". Fair enough -- given how little attention it has gotten, here is the main idea.

StaxMate is meant to offer "convenience of DOM with performance of Stax (or SAX)". Although Stax API was an improvement in usability for many use cases, it is still a rather low-level access API. StaxMate builds concept of "cursors" when reading content; and output context objects when writing content. Sample code and bit more in-depth explanation can be found from StaxMate Tutorial page; but basic idea is to offer better abstractions than simple flat event iterator. Sort of like how automatic transmission can simplify driving, compared to manual stick shift.

Working with cursors is typically similar to how DOM documents are traversed in simple top-down (recursive-descent) fashion: you start with root element, get child elements, locate more children, textual content and so forth. Same is done with StaxMate, with just one crucial limitation: all access must be done in document order (parent first, them children, in order they are in XML document). If you need to retain some information, you will do it explicitly (attribute values from parents need to be access before child elements, for example). StaxMate will take care to synchronize access when you use child cursors, so will never need to worry about skipping remaining siblings; you just can not access things in random order. Same is also true for output side; although there are ways to temporarily "freeze" output which does allow building content somewhat out-of-order, as necessary. This may be necessary for doing things like calculating parent attribute values based on content written for child elements.

The benefit of requiring access to be done in document order is that it means that there is no additional performance or memory overhead for keeping track of past content. Memory usage, therefore, is not very different from that of "raw" Stax parser or generator; same is true for performance. Overhead of DOM documents is often 3x - 5x that of streaming access; overhead of using StaxMate is typically in 10-20% range, sometimes even lower.

2. Fixes in 2.0.1

This patch release contains just 2 fixes, but both are quite important, so upgrade is strongly recommended.

First fix is to DOM-compatibility part (see "Reading DOM documents using Stax XML parser, StaxMate" for details on usage). It turns out that although building full DOM document worked fine with 2.0.0, there were issues if binding sub-trees; these issues should now be resolved.

Second fix is to interoperability with Stax parsers that do not implement Stax2 extension API (to date, Woodstox and Aalto do implement this, but not others; most notably, Sun Sjsxp which is the default Stax parser bundled with JDK 6). Although most operations work just fine, Typed Access accessors (getting XML element text as number, boolean value, enum) could cause state update to work incorrectly, leading to issues when accessing sequence of typed values. This has been resolved, by fixing the underlying problem in Stax2 API reference implementation library that StaxMate depends (version 3.0.4 of the library contains fixes).

Posted by Tatu Saloranta at Saturday, November 20, 2010 11:45 AM
Categories: Java, Open Source, XML/Stax
| Permalink |Comments | links to this post

Upgrading from Woodstox 3.x to 4.0

It has now been almost one year since Woodstox 4.0 was released.
Given this, it would be interesting to know how many Woodstox users continue using older versions, and how many have upgraded.

My guess (somewhat educated, too, based on bug reports and some statistcs on Maven dependencies) is that adoption has been quite slow. I think this is primarily due to 3 things:

Older versions work well, and fulfill all current needs of the user
New functionality that 4.0 offers is not widely known, and/or is not (currently!) needed
There are concerns that because this is a major version upgrade, upgrade might not go smoothly.

I can not argue against (1): Woodstox has been a rather solid product since first official releases; and 3.2 in particular is a well-rounded rock solid XML processor (if you are using an earlier version, however, at least upgrade to latest 3.2 patch version, 3.2.9!).
And with respect to (2), I have covered most important pieces of new functionality, Typed Access API and Schema Validation.

But so far I have not written anything about incompatible changes between 3.2 and 4.0 versions. So let's rectify that omission.

1. Why Upgrade?

But first: maybe it is worth iterating couple of reasons why you might want to upgrade at all:

You might want to validate XML documents you read or write against W3C Schema (aka XML Schema). Earlier versions only allowed validating against DTDs and Relax NG schemas
If you want to access typed content -- that is, numbers, XML qualified names, even binary content, contained as XML text -- new Typed Access API simplifies code a lot, and also makes it more efficient.
Latest versions of useful helper libraries like StaxMate require Woodstox 4.0 (StaxMate 2.0 needs 4.x, for example)
No new development will be done for 3.2 branch; and eventually not even bug fixes.

Assuming you might want to upgrade, what possible issues could you face?

2. Backwards incompatible changes since 3.2

Based on my own experiences, there are few issues with upgrade. Although the official list of incompatibilities has a few entries, I have only really noticed one class of things that tend to fail: Unit tests!

Sounds bad? Actually, yes and no: no, because these are not real failures (ones I have seen). And yes, since it means that you end up fixing broken test code (extra overhead without tangible benefits). But this is one of challenges with unit tests: fragility is often desireable, but not always so.

Specific problem that I have seen multiple times is related to one cosmetic aspect of XML: inclusion of white space with elements.

Woodstox 3.2 used to output empty elements with "extra" white space, like so:

but 4.0 will not add this white space:

(this is a new feature as per WSTX-125 Jira entry)

and so some existing unit tests for systems I have worked on compare literal XML for output tests. This is not optimal, but it is bit less work than writing tests in more robust way, to check for logical (not physical) equality. So whereas they formerly assume existence of such white space, tests need to be modified not to expect it (or allow either way).

3. Other challenges?

Actually, I have not seen any actual problems, or other cosmetic problems. But here are other changes that are most likely to cause compatibility problems (refer to the full list mentioned earlier for couple of changes that are much less likely to do so):

"Default namespace" and "no prefix" are now consistently reported as empty Strings, not nulls (unless explicitly specified otherwise in relevant Stax/Stax2 Javadocs). Usually this does not cause problems, because Stax-dependant code has had to deal with inconsistencies with other Stax implementations; but could cause problems if code is expecting null.
"IS_COALESCING" was (accidentally) enabled for Woodstox versions prior to 4.0. This was fixed for 4.0 (as per Stax specification), but it is possible that some code was assuming on never getting partial text segments (if developer was not aware of Stax allowing such splitting of segment, similar to how SAX API does it.

4. Upgrade or not?

I would recommend investigating upgrade; if for nothing else, because of maintenance aspect. Pre-4.0 versions will not be actively maintained in future. But it is good to be aware of what has changed, and of course having good set of unit tests should guard against unexpected problems.

And hey, it's soon 2010 -- Woodstox 3.2 is soooo 2008. :-)

Posted by Tatu Saloranta at Thursday, December 31, 2009 10:33 PM
Categories: Java, XML/Stax
| Permalink |Comments | links to this post

Data Format anti-patterns: converting between secondary artifacts (like xml to json)

One commonly asked but fundamentally flawed question is "how do I convert xml to json" (or vice versa).
Given frequency at which I have encountered it, it probably ranks high on list of data format anti-patterns.

And just to be clear: I don't mean that there is any problem in having (or wanting to have) systems that produce data using multiple alternative data formats (views, representations). Quite on contrary: ability to do so is at core of REST(-like) web services, which are one useful form of web services. Rather, I think it is wrong to convert between such representations.

1. Why is it Anti-pattern?

Simply put: you should never convert from secondary (non-authoritative) representation into another such representation. Rather, you should render your source data (which is usually in relational model, or objects) into such secondary formats. So: if you need xml, map your objects to xml (using JAXB or XStream or what you have); if you need JSON, map it using Jackson. And ditto for the reverse direction.

This of course implies that there are cases where such transformation might make sense: namely, when your data storage format is XML (Native Xml DBs) or Json (CouchDB). In those cases you just have to worry about the practical problem of model/format impedance, similar to what happens when doing Object-Relational Mapping (ORM).

2. Ok: simple case is simple, but how about multiple mappings?

Sometimes you do need multi-step processing; for example, if your data lives in the database. Following my earlier suggestion, it would seem like you should convert directly from relational model (storage format) into resulting transfer format (json or xml). Ideally, yes: if there are such conversions. But in practice it is more likely that a two-phase mapping (ORM from database to objects; and then from objects to xml or json) works better: mostly because there are good tools for separate phases, but fewer that would do the end-to-end rendition.

Is this wrong? No. To understand why, it is necessary to understand 3 classes of formats that are talking about:

Persistence (storage) format, used for storing your data: usually relational model but can be something else as well (objects for object DBs; XML for native XML databases)
Processing format: Objects or structs of your processing language (POJOs for Java) that you use for actual processing. Occasionally this can also be something more exotic; like XML when using XSLT (or relational data for complicated reporting queries)
Transfer format: Serialization format used to transfer data between end points (or sometimes time-shifting, saving state over restart); may be closely bound to processing format (as is the case for Java serialization)

So what I am really saying is that you should not transfer within a class of formats; in this case between 2 alternate transfer formats. It is acceptable (and often sensible) to do conversions between classes of formats; and sometimes doing 2 transforms is simpler than trying to one bigger one. Just not within a class.

3. Three Formats may be simpler than Just One

One more thing about above-mentioned three formats: there is also a related fallacy of thinking that there is a problem if you are using multiple formats/models (like relational model for storage, objects for processing and xml or json for transfer). Assumption is that additional transformations needed to convert between representations is wasteful enough to be a problem in and of itself. But it should be rather obvious why there are often distinct models and formats in use: because each is optimal for specific use case. Storage format is good for, gee, storing data; processing model good for efficiently massaging data, and transfer format good for piping it through the wire. As long as you don't add gratuitous conversions in-between, transforming on boundary is completely sensible; especially considering alternative of trying to find a single model that works for all cases. One only needs to consider case of "XML for everything" cluster (esp. XML for processing, aka XSLT) to see why this is an approach that should be avoided (or, Java serialization as transfer format -- that is another anti-pattern in and of itself).

Posted by Tatu Saloranta at Wednesday, October 28, 2009 10:22 PM
Categories: JSON, XML/Stax
| Permalink |Comments | links to this post

Critical updates: Woodstox 4.0.6 released

This just in: Woodstox 4.0.6 was released, and it contains just one fix; but that one to a critical problem (text content truncation for long CDATA sections, when using XMLStreamReader.getElementText()). Upgrade is highly recommended for anyone using earlier 4.0 releases.

One more potentially useful addition is that I uploaded "relocation" Maven pom, for non-existing artifact "wstx-asl" v4.0.6 (the real id is "woostox-core-asl", as of 4.0; "wstx-asl" was used with 3.2 and previous). This was suggested by a user, to make upgrade bit less painful -- problem is that Woodstox tends to be one of those ubiquitous transitive dependencies to anyone running a Soap service (or nowadays almost any server-side XML processing system).

Next big thing should then be Jackson-1.3, stay tuned!

Posted by Tatu Saloranta at Thursday, October 01, 2009 9:23 PM
Categories: Java, XML/Stax
| Permalink |Comments | links to this post

Typed Access API tutorial, part III/b: binary data, server-side

(note: this is part B of "Typed Access API tutorial: binary data"; first part can be found here)

1. Server-side

After implementing the client, let's next implement matching sample service that simply reads all files from a directory and creates download message that contains all files along with checksums for verifying their correctness (in real use case, those would probably be pre-computed). Simplest way to deploy service is as a Servlet-based web application; a single class and matching web.xml will do the trick.

Resulting code is meant to just show how (relatively) simple handling of binary data is -- obviously a real client and service would have much more checking for error cases, as well as for authentication, authorization, namespacing to avoid collision and so on.

Full source code can be found from Woodstox source code repository (see 'src/samples/BinaryService.java') but here is the beef:

    public void doGet(HttpServletRequest req, HttpServletResponse resp)
        throws IOException
    {
        resp.setContentType("text/xml");
        try {
            writeFileContentsAsXML(resp.getOutputStream());
        } catch (XMLStreamException e) {
            throw new IOException(e);
        }
    }

    final static String DIGEST_TYPE = "SHA"; 

    private void writeFileContentsAsXML(OutputStream out)
        throws IOException, XMLStreamException
    {
        XMLStreamWriter2 sw = (XMLStreamWriter2) _xmlOutputFactory.createXMLStreamWriter(out);
        sw.writeStartDocument();
        sw.writeStartElement("files");
        byte[] buffer = new byte[4000];
        MessageDigest md;
        try {
            md = MessageDigest.getInstance(DIGEST_TYPE);
        } catch (Exception e) { // no such hash type?
            throw new IOException(e);
        }

        for (File f : _downloadableFiles.listFiles()) {
            sw.writeStartElement("file");
            sw.writeAttribute("name", f.getName());
            sw.writeAttribute("checksumType", DIGEST_TYPE);
            FileInputStream fis = new FileInputStream(f);
            int count;
            while ((count = fis.read(buffer)) != -1) {
                md.update(buffer, 0, count);
		// note: can write separate chunks without problems
                sw.writeBinary(buffer, 0, count);
            }
            fis.close();
            sw.writeEndElement(); // file
            sw.writeStartElement("checksum");
            sw.writeBinaryAttribute("", "", "value", md.digest());
            sw.writeEndElement(); // checksum
        }
        sw.writeEndElement(); // files
        sw.writeEndDocument();
        sw.close();
    }

As with the client, there really isn't anything too special here. Just the usual service, with bit of Stax2 Typed Access API usage.

I briefly tested this by bundling it up as a web app (if you want to do the same, run Ant target "war.samples" in Woodstox trunk), running web app under Jetty 6.1, and accessing from both web browser and via BinaryClient class. Worked as expected right away (which, granted, was somewhat unexpected... usually there are minor tweaks needed, but not today).

2. Output

Just to give an idea of what results should look like, here's what I can see when download a single file (run.sh):

<?xml version='1.0' encoding='UTF-8'?>
<files><file name="run.sh" checksumType="SHA">IyEvYmluL3NoCgojIExldCdzIGxpbWl0IG1lbW9yeSwgZm9yIHBlcmZvcm1hbmNlIHRlc3RzIHRv
IGFjY3VyYXRlbHkgY2FwdHVyZSBHQyBvdmVyaGVhZAoKIyAtRGphdmEuY29tcGlsZXI9IC1jbGll
bnQgXApqYXZhIC1YWDpDb21waWxlVGhyZXNob2xkPTEwMDAgLVhteDQ4bSAtWG1zMTZtIC1zZXJ2
ZXJcCiAtY3AgbGliL3N0YXgtYXBpLTEuMC4xLmphcjpsaWIvc3RheF9yaS5qYXJcCjpsaWIvbXN2
L1wqXAo6bGliL2p1bml0L2p1bml0LTMuOC4xLmphclwKOmJ1aWxkL2NsYXNzZXMvd29vZHN0b3g6
YnVpbGQvY2xhc3Nlcy9zdGF4MlwKOnRlc3QvY2xhc3NlczpidWlsZC9jbGFzc2VzL3Rvb2w6YnVp
bGQvY2xhc3Nlcy9zYW1wbGVzXAogJCoK</file><checksum value="qAZIQ6GDUJYRgiubW/H+5GZaWg0="/></files>

3. More to known about Base64 variants

One more thing to note is the existence of multiple slightly incompatible Base64 variants (see "URL Applications" section). So which one does Typed Access API use?

The one you define it to use, of course! Stax2 API actually allows caller to specify the variant to use -- sample code just happens to use the default variant (i.e. uses methods that just call alternatives that do take a Base64Variant argument). Stax2-defined Base64 variants (from class 'org.codehaus.stax2.typed.Base64Variants') are:

MIME: this is what is usually considered "the base64" variant: uses default alphabet, requires padding, and uses 76-character lines with linefeed for content. This is the default variant used for element content.
MIME_NO_LINEFEEDS is similar to MIME, but does not split output in lines -- this is the default variant used for attribute values (due to verbosiveness caused by encoding linefeeds in XML attribute values)
PEM is similar to MIME, but mandates shorter (60 character) line length
MODIFIED_FOR_URL: uses alternate alphabet (hyphen and underscore instead of plus and slash), does not use padding or line splitting.

And these are all implemented by Woodstox. In addition, one can use custom encodings by implementing custom Base64Variant object and passing that explicitly to base64-binary read- and write-methods.

4. Performance?

Beyond simple usage shown so far, what more is there to know about handling binary data?

One open question is performance: how much faster is Typed Access API, compared to using alternatives like XMLStreamReader.getElementText() followed by decode using, say, JakartaCommons' base64 codec. There are no numbers yet, but producing some will be one of high priority items on my "things to research for Blog" list.

Posted by Tatu Saloranta at Friday, September 18, 2009 10:56 PM
Categories: XML/Stax
| Permalink |Comments | links to this post

Typed Access API tutorial, part III/a: binary data, client-side

(author's note: oh boy, this last piece of the "Typed Access API series" has been long coming -- apologies, and "better late than never")

Now that we have tackled most of the Stax2 Typed Access API (reading and writing simple values, arrays), let's consider the last remaining part: that of reading and writing base64-encoded binary data. For this installment, let's implement a simple web service that can be used for downloading files, as well as client to use that service.

Use of XML for such purpose may seem bit contrived, but there are other valid use cases for binary-in-xml (even if the example wasn't): for example, it may well make sense to embed small images (like icons), digital signatures, encryption keys and other non-textual data within documents. Sometimes convenience of inlining binary content within message is worth the modest overhead (base64 imposes +33% storage overhead, and similar processing overhead).
For example, in our example, we can embed multiple files with associated metadata quite easily without having to split the logical document. But both client and server can still handle files one-by-one with streaming interfaces, meaning that memory usage need not grow without bounds.

Finally, unlike many other xml processing packages, Woodstox does not cut corners when it comes to processing efficiency: base64 processing implementation is a significant improvement over using existing third-party base64 codes on other processing APIs (regular SAX, Stax or DOM).

So much for the philosophic part of why to use (or not to use) xml. Let's have look at a simple implementation to show binary content handling pieces that we need, along with a bit of glue to make example code work.
(note: source code is also accessible)

1. Message format

Here is the simple xml message format we will be using:

  <files>
    <file name="test.jpg" checksumType="SHA">... base64 encoded content ...</file>
    <checksum value="...base64 encoded hash of content..." />
    <!-- ... and more files, if need be... -->
  </files>

That is, a single message contains one or more files, each with associated checksum. Checksym is used to verify that contents were passed unmodified (as opposed to being corrupted by transfer). Simple but functional.

2. Client-side

So let's start with sample client code; code downloads bunch of files from the service (for now assuming URL determines set of files we'll get with some criteria).

For this example we will just use the regular http client that JDK comes equipped with (which actually works pretty well for many use cases -- for others, Jakarta httpclient is the cat's meow).
Full source code can be found at Woodstox SVN repository (under 'src/samples') but here's the interesting Client method:

public List<File> fetchFiles(URL serviceURL) throws Exception
{
  List<File> files = new ArrayList<File>();
  URLConnection conn = serviceURL.openConnection();
  conn.setDoOutput(false); // only true when POSTing
  conn.connect();
// note, should check 'if (conn.getResponseCode() != 200) ...'

// Ok, let's read it then... (note: StaxMate could simplify a lot!)
  InputStream in = conn.getInputStream();
  XMLStreamReader2 sr = (XMLStreamReader2) XMLInputFactory.newInstance().createXMLStreamReader(in);
  sr.nextTag(); // to "files"
  File dir = new File("/tmp"); // for linux...
  byte[] buffer = new byte[4000];

  while (sr.nextTag() != XMLStreamConstants.END_ELEMENT) { // one more 'file'
    String filename = sr.getAttributeValue("", "name");
    String csumType = sr.getAttributeValue("", "checksumType");
    File outputFile = new File(dir, filename);
    FileOutputStream out = new FileOutputStream(outputFile);
    files.add(outputFile);
    MessageDigest md = MessageDigest.getInstance(csumType);

    int count;
  // Read binary contents of the file, calc checksum and write
    while ((count = sr.readElementAsBinary(buffer, 0, buffer.length)) != -1) {
      md.update(buffer, 0, count);
      out.write(buffer, 0, count);
    }
    out.close();
  // Then verify checksum
    sr.nextTag();  
    byte[] expectedCsum = sr.getAttributeAsBinary(sr.getAttributeIndex("", "value"));
    byte[] actualCsum = md.digest();
    if (!Arrays.equals(expectedCsum, actualCsum)) {
      throw new IllegalArgumentException("File '"+filename+"' corrupt: content checksum does not match expected");
    }
    sr.nextTag(); // to match closing "checksum"
  }
  return files;
}

Much of the code deals with connecting to the service; actual access is rather simple; only complexity comes from streamability of API (i.e. you read chunks of binary data, instead of reading the whole thing).

What is left, then, is the server side... which will follow shortly (I swear, won't take months this time)

Posted by Tatu Saloranta at Tuesday, September 08, 2009 10:44 PM
Categories: XML/Stax
| Permalink |Comments | links to this post

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Posted by Tatu Saloranta at Thursday, July 09, 2009 10:51 PM
Categories: Open Source, Rant, XML/Stax
| Permalink |Comments | links to this post

CowTalk

Moo-able Type for Cowtowncoder.com

Tuesday, March 27, 2012

Jackson 2.0: now with XML, too!

Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Monday, February 28, 2011

Jackson: not just for JSON, Smile or BSON any more -- Now With XML, too!

Saturday, November 20, 2010

StaxMate 2.0.1 released; improved DOM-from-Stax, compatibility with default JDK 1.6 Stax implementation

Thursday, December 31, 2009

Upgrading from Woodstox 3.x to 4.0

Wednesday, October 28, 2009

Data Format anti-patterns: converting between secondary artifacts (like xml to json)

Thursday, October 01, 2009

Critical updates: Woodstox 4.0.6 released

Friday, September 18, 2009

Typed Access API tutorial, part III/b: binary data, server-side

Tuesday, September 08, 2009

Typed Access API tutorial, part III/a: binary data, client-side

Thursday, July 09, 2009

Are GAE developers a bunch of

Search

Last posts

Categories

Archives

Related Blogs

Powered By

About me