Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

  1. It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
  2. It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (java.util.zip.Deflater).

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.


  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
feeder.endOfInput(); } } // and once we have full event, we just dump out event type (for now) System.out.println("Got event of type: "+type); // could also just copy event as is, using Stax, or do any other normal non-blocking handling: // xmlStreamWriter.copyEventFromReader(asyncReader, false); } while (type != END_DOCUMENT); asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

  1. Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
  2. Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
  3. Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

  • Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
  • There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.