Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

  1. It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
  2. It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.

  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
feeder.endOfInput(); } } // and once we have full event, we just dump out event type (for now) System.out.println("Got event of type: "+type); // could also just copy event as is, using Stax, or do any other normal non-blocking handling: // xmlStreamWriter.copyEventFromReader(asyncReader, false); } while (type != END_DOCUMENT); asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

  1. Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
  2. Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
  3. Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

  • Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
  • There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

Friday, March 11, 2011

Jackson 1.8: custom property naming strategies

One of big goals for Jackson 1.8 is to implement oldest open feature requests. One such feature is ability to customize property naming strategy: that is, to allow use of JSON names that do not match requirements of Bean naming conventions.

For example, consider case of Twitter JSON API, which uses "C-style" naming convention, so what would be "profileImageUrl" with bean naming convention, is instead "profile_image_url".
With earlier Jackson versions, one has had to annotate all properties with @JsonProperty annotation; or use rather unwieldy method names like "getprofile_image_url()" (which would work, just look ugly).

But version 1.8 will finally allow use of custom naming strategies. Let's examine how.

First thing to do is to extend

  static class CStyleStrategy extends PropertyNamingStrategy
    public String nameForField(MapperConfig?> config, AnnotatedField field, String defaultName) {
return convert(defaultName); } public String nameForGetterMethod(MapperConfig?> config, AnnotatedMethod method, String defaultName) { return convert(defaultName); } public String nameForSetterMethod(MapperConfig?> config, AnnotatedMethod method, String defaultName) { return convert(defaultName); } private String convert(String input) { // easy: replace capital letters with underscore, lower-cases equivalent StringBuilder result = new StringBuilder(); for (int i = 0, len = input.length(); i < len; ++i) { char c = input.charAt(i); if (Character.isUpperCase(c)) { result.append('_'); c = Character.toLowerCase(c); } result.append(c); } return result.toString(); } }

which in this case will just convert property names by replacing all capital letters with underscore followed by lower-cased version of the letter.
And then the only other thing is to register this strategy with ObjectMapper:

   ObjectMapper mapper = new ObjectMapper();
   mapper.setPropertyNamingStrategy(new PrefixStrategy());

and after this we could do:

  static class PersonBean {
        public String firstName;
        public String lastName;

        public PersonBean(String f, String l) {
            firstName = f;
            lastName = l;

  // so this would hold true:
  assertEquals("{\"first_name\":\"Joe\",\"last_name\":\"Sixpack\"}", new PersonBean("Joe", "Sixpack"));

This should help working with APIs that use C-style underscore notation, or variation of camel casing (such as one where first word is also capitalized).

Thursday, March 10, 2011

Upgrade from org.json to Jackson, piece by piece, using jackson-module-org-json

1. Background

One common task for software engineers is upgrading legacy systems; things written a while ago, often by developers who have since moved on, using tools and techniques thought to be good fit back then. While the reason for upgrades are typically either addition of minor new features, or fixes to the most annoying issues, it can and should also include some pro-active work to improve quality of the system itself, to "pay off technical debt" before it is due.

One of first things you usually observe, when looking under the hood of such systems is that various design and implementation choices do not look so good any more. Sometimes technology choice turned to be a dead-end; sometimes design was based on incomplete or incorrect understand. But almost always some of supporting libraries have become obsolete: tool that was thought as best-of-breed back then may well now be amongst most despised pieces of software ever deviced.

2. Problem

One common case of "library rot" that I have often encountered is that of using "org.json" library (from for reading and writing JSON. It was the first Java JSON library around, and served as sort of proof of concept back in the early days of JSON usage. But time has not been kind to it: with arrival of more modern, fully-featured alternatives, it is the monoplane of the second world-war era; once serviceable, but now just cumbersome and slow thing that few would choose to use given a choice of modern JSON libraries like Jackson. You can get improvements in about any area, from convenience and intuitiveness to performance, just by upgrading.

When considering possible upgrate of the JSON library, the initial issues encountered are usually not very big: changing code that directly deals with reading and writing JSON is not awfully hard, and typically resulting code is much cleaner. But there is often much bigger problem outside this core JSON code: in case of org.json package, specifically, all processing is done using concrete types (JSONObject, JSONArray). And values of these types are very easy to leak outside, resulting in unintended tight coupling of non-JSON-processing code with the JSON parser implementation. While this may be a good lesson on values of proper design and encapsulation of concerns, it is of little consolation when you are looking at code that had neither.

If your only choice is to rewrite major portions of the legacy system (to mostly address relatively small JSON-specific portion), it is often easier to just let things be as they are.

3. But what if we only upgraded parser, not JSONObject/JSONArray?

Thinking about this a bit, I realized that the biggest immediate obstacle is really deep coupling via 2 value types. So what if we just continued to use these non-abstract "abstractions", but replace underlying JSON parser and generator? While this would not allow use of more convenient and proper real abstractions (of data binding to POJOs), at least it could give some nice efficiency improvements (more on this below); and make it easier to do complete conversion bit later on. Killing a beast is often easiest by thousand cuts.

Fortunately for us, Jackson 1.7 made it very easy to add support for additional data types; so I decided to spend one (and only one!) evening to write an extension module -- jackson-module-json-org available from GitHub as -- to allow using Jackson to effectively read and write JSONObject and JSONArray types.

Usage is rather straight-forward; after registering new module with ObjectMapper

ObjectMapper mapper = new ObjectMapper();
mapper.registerModule(new JsonOrgModule());

you can use all the usual conversions like so:

// read/write JSON
JSONObject ob = mapper.readValue(json, JSONObject.class); // read from a source
String json = mapper.writeValue(ob); // output as String

// convert POJOs to/from JSONObject/JSONArray
MyValue value = mapper.convertValue(jsonObject, MyValue.class);
JSONObject jsonObject = mapper.convertValue(value, JSONObject.class);

// and even conver to/from Jackson Tree model:
JsonNode root = mapper.valueToTree(jsonObject);
jsonObject = mapper.treeToValue(root, JSONObject.class);

and there you have it, convenient conversions between JSON, POJO, and org.json types.

In fact, this will also help with later stages, because you can easily (and relatively efficiently) go back and forth between representations: so we can extend "org.json-free" zone further and further away from the source. This is what I really mean by piece-by-piece approach: it is possible to do refactoring one component or area at a time.

4. But wait! There is more.... performance bonus!

By now it should be well-known that Jackson is the most performant library for doing JSON manipulation in Java. So could it be that using this module might speed up processing as well? Aren't we still using some org.json pieces? Turns out that since JSONObject is little more than a wrapper around basic Java HashMap (and JSONArray similarly around ArrayList), most performance benefits from different parser and generator. So I went ahead and modified my JSON parser performance test to compare "raw Jackson" (using Jackson Streaming API), "Jackson with Tree Model" (JsonNode), "Jackson with org.json" and stock org.json. Here is one sampling of results I saw on my machine (using sample document #5 from org.json):

Test 'Jackson, stream' -> 282 msecs
Test 'Jackson, JsonNode' -> 411 msecs
Test 'Jackson + module-org-json' -> 393 msecs
Test '' -> 783 msecs

It turns out that while Streaming API is the fastest way, the new "jackson-module-json-org" extension can actually bind JSON as fast as Jackson Tree model -- both of which are twice as fast as basic org.json package for this small document (for bigger documents difference is typically even bigger).

So in addition to cleaning up the code, you may also end up speeding up your JSON processing code by 100%. Not a bad deal, eh?

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.