Monday, June 19, 2006

Nux version 1.6 released

Another release, and this one very important, is that of Nux XML processing toolkit: version 1.6 was just released. For those not familiar with Nux, it is a collection of useful helper functionality and extensions that make XML processing (and some related content management things like search indexing) more convenient, powerful or easier, or oftentimes, "all of above". Some notable features are:

  • Extended XQuery/XPath support (with Saxon)
  • XOM extensions, including efficient builders and serializers
  • VERY fast Binary XML implementation, BNux (about 2x as fast as Sun's Fast Infoset -- and thereby almost 3x as fast as Woodstox); possibly the fastest way to transform XML Infoset content in Java.
  • Full-text search extensions based on Lucene; especially for cases where individual documents can be stored fully in memory, but need to be searched using powerful search queries.
  • Extensive performance test suite that allows for comparing various XML processor (parsing, serialization) implementations; from binary variants to traditional text-based ones (DOM, SAX, StAX).

Personally I am very impressed by the binary infoset implementation, as well as the overall quality of the package. It is one of those packages I wish more developers were aware of.

Stax Reference Implementation version 1.2 release

This may not be a big news, but since many newcomers to Stax-based XML processing start with the reference implementation, the release of the 1.2 version final is still important. There has been lots of fixes to the most immediate and severe problems, and now the reference implementation finally passes most of StaxTest unit tests.

There are still some things missing from the RI (such as coalescing mode, NamespaceContext for Event objects), but at least now it should be possible to try out simple example documents without hitting obvious problems with character entities and such. Also, handling of namespace URIs is more consistent now: null will now always be returned for prefixes that are not bound (and for the default namespace if no explicit binding is present).

Thursday, June 15, 2006

New features of upcoming release 3.0 of Woodstox

Although the final 3.0 release of Woodstox has not yet been done, the API and feature set are now frozen, after the release of the first release candidate. Because of this, now is the perfect time to have a look at what the 3.0 release will bring, compared to the trusty old 2.0.x version.

At high-level, main changes are:

  • Rewritten validation sub-system, along with a new validator implementation for Relax NG.
  • Significant performance improvements to both stream reader and writer; especially when processing small documents.
  • Further improvements to XML conformance, now XML 1.0 and 1.1 conformance is over 99% as measured by the industry standard XMLTest conformance suite (tested using SAXTest and SAX wrappers from stax-utils). Specifically conformance of DTD-handling is significantly improved.
  • Significantly improved test coverage, both regarding features tested and actual code coverage.
  • Improved interoperability; behavior unified with the Stax reference implementation (including changes to one or both, as dictated by accepted Stax specification interpretations); addition of DOMSource that allows creating of XMLStreamReader from DOM tree; and addition of UTF-32 encoding support.
  • Additional operating modes: parsing mode (tree [default], forest, fragment); ability to handle undeclared entities gracefully (in non-entity-expanding mode).
  • Convergence of writer and reader side functionality, by adding features to writers that were missing (optional line number reporting, xml warning handler, disabling of namespace handling), done in the context of Stax2 extended API.

For even more complete picture of many of the individual changes, you can check out the Jira bug-tracking system used by Woodstox project; there are almost 40 resolved entries for 3.0 release.

Of the changes mentioned above, the first one may be the most significant new feature. It also resulted in the complete rewrite of existing (DTD) validation system. The re-designed system is now:

  • Fully pluggable: org.codehaus.stax2.validation.XMLValidationSchemaFactory implementations can be included similar to the way basic Stax implementations of XMLInputFactory and XMLOutputFactory can be included and discovered dynamically. This can theoretically allow implementation of cross-implementation validators in future. 3.0 release includes the rewritten native DTD validator, as well as a Relax NG validator based on Sun's Multi-Schema Validator. There are also plans to include MSV-based W3C Schema Validator in near future
  • Bi-directional: same validators can be used both when parsing (with XMLStreamReader) and when serializing (with XMLStreamWriter).
  • Validators are chainable, so that a single XML event stream can be validated against multiple validation schemas.
  • Customizable error handling: fail-fast (exception on validation error), error-collecting, or a combination (collect up to 50 first errors).

Related to the validation system re-design, the native DTD validator implementation was also rewritten. The result is fully conformant DTD validator, including all well-formedness checks reliably implemented. Handling of the default attribute values and attribute types is now done in DTD-aware but non-DTD-validating mode (unlike in 2.0).

As to testing, one important part is that now code coverage testing started during 3.0 development. Coverage reached 60% - 80% (for code lines covered to methods covered, respectively), which should be a good start going forward. And work with StaxTest Stax conformance test suite (which is included with the reference implementation, and also used for ref. impl. testing) improved compatibility between Woodstox and the reference implementation. And finally, the unit test suite specific to Woodstox itself was significantly improvede. These changes together suggest that the 3.0 release will be the best test release so far, and hopefully have even fewer bugs than 1.0.x and 2.0.x releases earlier.

And the feature that many developers will hopefully find tempting as well is the performance. It may seem unsual that the improvements in standards compliancy could go together with performance improvements, but this is the case for 3.0. Changes are most noticeable when dealing with small documents, since the setup overhead (which is significantly reduced with 3.0) is most noticeably for such documents. Serialization side has also been optimized for the first time -- whereas 1.0 and 2.0 focused in trying to make handling correct, 3.0 development cycle included time for performance improvements. And the results are well worth the time spent.

All in all 3.0 release will hopefully be another big step for Woodstox: the quality and reliability should be even better than those of 2.0. The configurability and feature sets are improved. And all this while increasing performance!

Wednesday, June 14, 2006

A word from our sponsor...

Ok, I'll give up! After struggling financially for weeks, it has become financially impossible for me to do this all with my limited low-paying job as a small little code monkey. So I have let myself become a capitalists' lackey, and installed somce additional decorations that you can see on the right side of the textual content. Please give a big round of applause to mr. G. Welcome, sir, and may you have mercy on my blog site. :-)

Seriously though, I hope the Ads you may vaguely notice on one of side of your field of vision do not distract you too much. They are added in the hopes that they may cover part of the modest on-going hosting expenses [editor's note: big round of applause to our nice hosting provider, Web Intellects  -- and no, this one is not a paid placement, we are just happy small customers]. Depending on how this all works out, I will do my best to limit screen real estate taken by all this additional decorative material.

And after this brief interruption, back to our regular programming.

3 Simple Rules for Fast XML-processing using Stax (aka "How to make Woodstox fly")

Although streaming XML-processing (including both SAX event-based "push parsing", and Stax cursor-based "pull parsing") has potentially high processing throughput (in the range of over 10 megabytes per second), in practice many developers end up with sub-standard performance. Why is this the case?

One important issue is that there are many functionally correct, but inefficient way of doing the processing. Often API documentation does or can not indicate optimal ways — after all, many performance characteristics are implementation dependant. But in practice, there are certain usage patterns that are most likely to have good performance characteristics. Following list contains 3 simple general rules, with examples of applicability, that can have significant positive performance impact when using Woodstox Stax-processor. They are also likely to help with other implementations, and at the very least should not have negative effects. These rules are:

  • Be green: reuse components that are designed to be reusable.
  • Don't do processor's job, just give it enough information ("tell me what to do, not how do it").
  • Close what you have opened.

Above rules can be considered to be in rough priority order: the first rule will very likely have effect on all implementations. But beyond potential performance benefits, the other two rules are actually also general good programming practices.

However, above rules are just loose principles, guidelines that may not be of much immediate use. So what practical examples are there about following the rules? Here are the examples I can think of.

  • Component reuse:
    • With Stax this is quite easy, since most objects can not be reused (no methods to reset things), and those that can usually should be reused.
    • The main class of reusable Stax objects are factories: XMLInputFactory and XMLOutputFactory. Although Stax specification itself is not specific regarding thread safety (or lack thereof) of these factories, all existing open source implementations follow the same pattern: factories are not guaranteed to be thread-safe when they are configured, but they are afterwards, when only reader/writer creating factory methods are used. So as long as you do configuration from a single thread (possibly from static initializer of a class, or constructor of a singleton object) and complete it before calling the factory methods, you can safely keep reusing the factory, to produce stream/event readers and writers from multiple non-synchronized threads.
    • There are 2 main reasons for performance benefits of reusing input and output factories:
      • Implementation discovery overhead when instantiating factory objects via XMLXxxFactory.newInstance() method. The dynamic way Stax API uses for finding out which Stax API implementation to use is very versatile and dynamic — and also very slow. It may take multiple milliseconds for it to introspect which jar contains the implementation, and for small documents, this overhead is bigger than time it takes to process the content itself! (for example, with parsing speed of 10 MBps, parsing a 1k SOAP message takes about 0.1 milliseconds: contrast that to, say, 5 millisecond overhead of factory instantiation!). So even if you could not reuse the factory, you may at the very least keep reference to the actual factory class, and call Class.newInstance() instead (which avoids the excessive overhead as it is just a regular reflection call). It is worth noting that similar overhead is incurred when using JAXP or SAX APIs, when dynamically constructing implementation objects.
      • Per-factory caching: some caching schemes implementations use (specifically, symbol table reusing and DTD caching Woodstox uses) are done on per-factory basis. Access to these caches is properly synchronized and thread-safe, but can not function efficiently unless readers and writers are produced by the same factory: new instances start with empty caches.
    • Factory instance reuse (as with most other optimizations) has most significant relative effect when processing small documents. In such cases, overall throughput with reuse can be twice as high, or more, when properly reusing input and output factories.
  • Giving processor information, but let it do its job:
    • The most obvious example of this rule is that if you have an InputStream that can be used to access XML content to parser, do not try to help stream reader by constructing a Reader for it (unless you have specific functional need for it). It is likely that the input factory can actually find a more optimal Reader implementation. It may also be able to reuse internal Reader buffers (esp. when coupled with "close your streams", see below).
    • Similarly, in cases where you can pass just a reference (like URL, or File), it is better to pass that reference instead of trying to construct a Reader, or even InputStream. Although basic Stax 1.0 API does not have such methods, some implementations have additional methods (Woodstox, for example, has its own "Stax2" set of extensions) that allow passing such references for constructing instances.
    • Giving processor enough information, however, is important: for example, if you do know what the encoding of content should be, it is a good idea to pass that to the factory method (if possible). Implementation can then use it if and as necessary.
  • Closing what you have opened:
    • This is a simple common sense rule, but it can also have positive performance impact. For example, Woodstox can reuse some of its internal data structures if it knows when caller is done with a stream/event reader: once a reader or writer is closed, it can not be used any further. Although it is sometimes possible for the implementation to know when active use ends (for example, when END_DOCUMENT is returned by a stream reader, no further information can be retrieved), it is better to explicitly declare end of active use. This can be done by calling XMLStreamReader.close() (and similarly for event reader, and stream/event writers).
    • As with factory object reuse, the effects of internal data structure reuse by implementations is most significant for small documents: doubled throughput is possible for documents of size 1 kB.

Given all of above, how much does this really matter? For small documents (in 1 - 2 kilobyte range), difference in throughput between the optimal usage, and the slowest naive (but functionally correct) approach can be up to 5 times (and even more for the degenerate case of calls to XMLInputFactory.newInstance() for each instance created). So it is probably worth following up these simple rules.

Sunday, June 11, 2006

Let's talk about Stax!

Although the third standard Java API for XML processing, Stax, was specified a while ago (JSR-173 final release happened on 25 March 2004), not much has been written about it, or its usage. There are some articles, and tutorials, but in generally they only touch the surface. These were also mostly written around time the specification was finalized (or sometimes before), and as such do not cover latest developments regarding the current state of the implementations. Much has changed since early 2004, in positive way.

This is unfortunate, since this API (or in general, type of XML processing it defines, often called "pull parsing") offers significant benefits for many types of XML processing tasks.

The lack of articles, and documentation in general has many reasons. Among these are:

  • After initial interest in Stax, and articles, problems were found with the reference implementation. The development of the RI also seemed stagnant. This may have left early adopters disillusioned, and perhaps also indicated that possibilities of Stax implementations themselves are limited.
  • Developers who have interest in and use for Stax are generally more experienced developers, and reasonably quickly managed to solve the immediate problems they had (or abandon the approach); either way, there is little need for tutorial after one has to dig deep in the code, or has lost the interest in the API as whole.
  • Low-level XML processing in general is often not needed: as long as there are higher level processing systems (such as XSL for transformations, XMLBeans and JAXB for data binding, various SOAP libraries for SOAP processing), it is possible for developers to have fully functioning systems without ever directly manipulating XML content. In this regard Stax is similar to Sax: both offer access to XML at the lowest possible abstraction level. The reason, then, for little being written could be that there is no perceived need.

But I believe it would be very useful to get more and more accessible content regarding Stax API itself, as well as the current state of and plans for the actively developed implementations. Regarding issues listed above:

  • Since the reference implementation was released and open source, 2 new actively maintained implementations have been released, both of which surpass functionality and quality of the reference implementation:
  • Even the experienced developers would benefit a lot from learning about some of more subtle issues regarding Stax implementations. For example, even though Stax API defines functionally how things should work, there are often multiple functionally equivalent ways of doings things, with varying efficiency. It is not necessarily clear, without further documentation, which are the "best practices" of doing things. Since Stax processing is potentially the fastest way to process XML from Java, performance differences are often more important with Stax, than with higher level tools.Al
  • Although it is often not necessary to process XML at the lowest possible level, it is still useful to know how to do it when it is necessary. For example, there are currently few higher-level libraries or tools that can operate with documents whose size exceeds available memory. Since streaming processing (that both SAX and Stax can do) can do just this, it is very useful to know about approaches. And since both SAX and Stax have their own pros and cons (both at API and implementation level), general knowledge of Stax should prove useful for anyone dealing with XML on Java platform.

So... It is all nice to talk about writing about Stax. But wouldn't it be better to actually write about the dang thing? As an author of Woodstox, I am in good position to try to do my share by writing about thing or two I know about Stax, Woodstox, performance tricks and tips, and all related things. So how about we "Talk About Stax"? ("... Let's talk about all the good things, And the bad things that may be...")

Stay tuned: I will start writing things I have meant to document for a long time, including but not limited to:

  • How to REALLY use the XMLStreamWriter ("Repairing WHAT mode?")
  • What matters with respect to speed: aka "How to make my XML processing code fly"
  • How do I validate documents?
  • What's new with Woodstox "experimental" Stax2 extension to basic Stax API.
  • How can I customize quoting of characters with XMLStreamWriter
  • Is Stax really fast? How about binary encodings, like the Fast Infoset, is that really fast?



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.