3 Simple Rules for Fast XML-processing using Stax (aka "How to make Woodstox fly")
Although streaming XML-processing (including both SAX event-based "push parsing", and Stax cursor-based "pull parsing") has potentially high processing throughput (in the range of over 10 megabytes per second), in practice many developers end up with sub-standard performance. Why is this the case?
One important issue is that there are many functionally correct, but inefficient way of doing the processing. Often API documentation does or can not indicate optimal ways — after all, many performance characteristics are implementation dependant. But in practice, there are certain usage patterns that are most likely to have good performance characteristics. Following list contains 3 simple general rules, with examples of applicability, that can have significant positive performance impact when using Woodstox Stax-processor. They are also likely to help with other implementations, and at the very least should not have negative effects. These rules are:
- Be green: reuse components that are designed to be reusable.
- Don't do processor's job, just give it enough information ("tell me what to do, not how do it").
- Close what you have opened.
Above rules can be considered to be in rough priority order: the first rule will very likely have effect on all implementations. But beyond potential performance benefits, the other two rules are actually also general good programming practices.
However, above rules are just loose principles, guidelines that may not be of much immediate use. So what practical examples are there about following the rules? Here are the examples I can think of.
- With Stax this is quite easy, since most objects can not be reused (no methods to reset things), and those that can usually should be reused.
- The main class of reusable Stax objects are factories: XMLInputFactory and XMLOutputFactory. Although Stax specification itself is not specific regarding thread safety (or lack thereof) of these factories, all existing open source implementations follow the same pattern: factories are not guaranteed to be thread-safe when they are configured, but they are afterwards, when only reader/writer creating factory methods are used. So as long as you do configuration from a single thread (possibly from static initializer of a class, or constructor of a singleton object) and complete it before calling the factory methods, you can safely keep reusing the factory, to produce stream/event readers and writers from multiple non-synchronized threads.
There are 2 main reasons for performance benefits of reusing input
and output factories:
- Implementation discovery overhead when instantiating factory objects via XMLXxxFactory.newInstance() method. The dynamic way Stax API uses for finding out which Stax API implementation to use is very versatile and dynamic — and also very slow. It may take multiple milliseconds for it to introspect which jar contains the implementation, and for small documents, this overhead is bigger than time it takes to process the content itself! (for example, with parsing speed of 10 MBps, parsing a 1k SOAP message takes about 0.1 milliseconds: contrast that to, say, 5 millisecond overhead of factory instantiation!). So even if you could not reuse the factory, you may at the very least keep reference to the actual factory class, and call Class.newInstance() instead (which avoids the excessive overhead as it is just a regular reflection call). It is worth noting that similar overhead is incurred when using JAXP or SAX APIs, when dynamically constructing implementation objects.
- Per-factory caching: some caching schemes implementations use (specifically, symbol table reusing and DTD caching Woodstox uses) are done on per-factory basis. Access to these caches is properly synchronized and thread-safe, but can not function efficiently unless readers and writers are produced by the same factory: new instances start with empty caches.
- Factory instance reuse (as with most other optimizations) has most significant relative effect when processing small documents. In such cases, overall throughput with reuse can be twice as high, or more, when properly reusing input and output factories.
Giving processor information, but let it do its job:
- The most obvious example of this rule is that if you have an InputStream that can be used to access XML content to parser, do not try to help stream reader by constructing a Reader for it (unless you have specific functional need for it). It is likely that the input factory can actually find a more optimal Reader implementation. It may also be able to reuse internal Reader buffers (esp. when coupled with "close your streams", see below).
- Similarly, in cases where you can pass just a reference (like URL, or File), it is better to pass that reference instead of trying to construct a Reader, or even InputStream. Although basic Stax 1.0 API does not have such methods, some implementations have additional methods (Woodstox, for example, has its own "Stax2" set of extensions) that allow passing such references for constructing instances.
- Giving processor enough information, however, is important: for example, if you do know what the encoding of content should be, it is a good idea to pass that to the factory method (if possible). Implementation can then use it if and as necessary.
Closing what you have opened:
- This is a simple common sense rule, but it can also have positive performance impact. For example, Woodstox can reuse some of its internal data structures if it knows when caller is done with a stream/event reader: once a reader or writer is closed, it can not be used any further. Although it is sometimes possible for the implementation to know when active use ends (for example, when END_DOCUMENT is returned by a stream reader, no further information can be retrieved), it is better to explicitly declare end of active use. This can be done by calling XMLStreamReader.close() (and similarly for event reader, and stream/event writers).
- As with factory object reuse, the effects of internal data structure reuse by implementations is most significant for small documents: doubled throughput is possible for documents of size 1 kB.
Given all of above, how much does this really matter? For small documents (in 1 - 2 kilobyte range), difference in throughput between the optimal usage, and the slowest naive (but functionally correct) approach can be up to 5 times (and even more for the degenerate case of calls to XMLInputFactory.newInstance() for each instance created). So it is probably worth following up these simple rules.