Monday, March 16, 2009

Using Woodstox via SAX API

One of less well-known features of Woodstox is that it implements SAX API as well as Stax API (and has since version 3.2). The reason this is not widely known is probably due to Woodstox starting as just a Stax implementation. But by now Woodstox has had enough time to mature as a SAX implementation, and it should be ready for serious use.

Why use SAX (instead of Stax)?

In my opinion, there are 2 main reason to choose to use SAX API instead of Stax:

  • Interoperability: maybe the tool you need (like, say, XSLT processor or Schema validator) only supports XML content via SAX API (often since it predates Stax API).
  • Chainability/pipelining: Event/callback-based approach SAX is a natural fit for processing pipelines; more so than Stax pull parsing. So although Stax is arguably more convenient to use if your application is in full and complete control of XML processing, it is less convenient if processing has to be done in a pipelined fashion.

I have used SAX for XSLT processing: while some processors like Saxon do offer Stax support, it is often still labelled as experimental, and not considered quite equal with SAX as the input source.

Why use Woodstox as the SAX parser?

But even if you choose to use SAX API, isn't there already another mature full-featured Java XML parser available? So why use Woodstox over the incumbent SAX candidate?

As the author of Woodstox I may be biased here, but I think there are couple of things that favor "Woodstox the SAX Parser":

  • Features: Woodstox does offer some configurability other processors do not, such as the ability to process XML content in "fragment" or "multi-document" mode
  • Good error reporting: as unpleasant as it is to get errors reported to you, at least Woodstox tries to do that discretely, timely and accurately. In fact, a lot of effort has gone to keep all information (like location, exact cause of problem) accurate and available. One area where this is obvious is handling of DTD problems.
  • Performance: Woodstox has been consistently tested to be the fastest Open Source Java XML parser, independent of API used for parsing (Stax or SAX).

How to use Woodstox as a SAX parser?

As with other SAX implementations, there are multiple ways to construct a Woodstox-flavor SAX parser instance, but probably the most commonly used way nowadays is to use JDK-bundled (since 1.4) Java Api for Xml Processing, JAXP (javax.xml.parsers.*). Woodstox class com.ctc.wstx.sax.WstxSAXParserFactory implements javax.xml.parsers.SAXParserFactory. And to construct an instance, there are two main possibilities:

  1. construct it directly, or
  2. set appropriate system property (named not-so-surprisingly, "javax.xml.parsers.SAXParserFactory") to point to implementation class, and then call SAXParserFactory.newInstance()

(for more details, check out Javadocs for SAXParserFactory).

So, for example:

  SAXParserFactory spf = new WstxSAXParserFactory();

after which you construct a parser (content handler) as usual, doing something like:

  spf.setNamespaceAware(true); // yes, better enable namespaces
  SAXParser sp = spf.newSAXParser();
  MyHandler h = new MyHandler();
  sp.parse(new File("data.xml", h);

And that's about it. If you know how to use Xerces/SAX, you know how to use Woodstox/SAX. And I hope that if you do, please give feedback on the Woodstox user list.

More to Come!

I should have more to write, regarding the performance aspects of using Woodstox: as part of "StaxBind" performance benchmark test suite, I have written and run a few performance tests to measure XSLT performance (both with Xalan and Saxon, 2 main Java XSLT processors, both of which work just fine with Woodstox, Xerces and Aalto).
Results look promising (if not earth-shattering) for Woodstox. But I still need to spend a bit more time in ensuring that tests are fair, and results readable, so they will need to be part of a later entry. Stay tuned!

Thursday, March 12, 2009

StaxMate 2.0: another Big Leap towards convenient, efficient xml processing

StaxMate 2.0.0 is now out, to augment your favorite Stax XML parser (like Woodstox or Aalto).
(for introduction to StaxMate, check out this tutorial).

Improvements for this release are focused in 3 main areas:

  • Convenient AND efficient access to typed content. With a little bit of help from a new version of Stax2 extension API (version 3.0), it is now possibly to efficiently read and write values of numeric (int, long, double), boolean and enumerated (Java enums) types.
    In future, more methods will be added to allow similar access to numeric arrays and base64-encoded binary content.
  • More convenience methods:
    • SMinputCursor.advance() to allow chaining, for example: int value = cursor.childElementCursor().advance().getElemIntValue()
    • SMInputCursor.asEvent() to construct XMLEvents for the current event.
    • Ability to pre-declare namespaces on output (using SMOutputElement.predeclareNamespace()) to minimize number of namespace declarations (not usually needed, sometimes is)
    • SMOutputElement.addElementWithCharacters() for a convenient short-cut.
  • Interoperability improvements:
    • DOMBuilder for building DOM documents out of XMLStreamReaders, and serializing DOM documents and elements using XMLStreamWriter.
    • StaxMate jar is now a fully functioning OSGi bundle

One reason for the major version bump is that this version requires implementation of Stax2 version 3.0, natively implemented by Woodstox and Aalto, and emulated for others (like Sjsxp) using Stax2 reference implementation. This version upgrade will offer wider range of functionality for future, and similar upgrade should not be needed in near future.

Saturday, March 07, 2009

Costs and Benefits of Unit Testing

Unit testing is someting that brings most value for well-modularized, reusable (and most imporantly, actually reused!) components, such as libraries. Say, xml or json parsers or data binders. :-)
And less (IMO) for higher-level non-reusable things such as business applications and their user interfaces.

Because of this, I have grown to both do much extensive unit testing with Woodstox and Jackson over time, and like the results. In fact, unit testing is one of 2 primary reasons for implementation quality of these packages (the other is having active user base that finds and reports problems). So whereas initially I mainly just added unit tests to verify bug fixes (and guard against regression) -- and this I still rigorously do -- nowadays I verify new features as well. For example, Woodstox 4.0 Typed Access API implementation has extensive unit test suite

1. Benefits

This level of unit testing has produced many benefits: consider for example the fact that so far no bugs have been reported by users against the Typed Access API implementation (to be fair, its usage has probably been limited so far as well). All bugs reported against Woodstox 4.0 so far have occured in non-tested (like Maven deployment) or lightly-tested (W3C Schema validation) areas.
And with Jackson, most issues reported concern missing features, or non-intuitive bevahior ("broken as designed"): very few are related to actual implementation flaws.

2. Where's your beef?

So if life is so good, what's with the pickle face? Just this: unit tests don't come free. In fact, for the most part I now seem to spend more time writing unit tests than writing code. And especially so for bug fixes. While this is not based on strict measurements (it would be useful to quantify it of course; but yet another time drain), I have checked out time usage occasionally during development -- when you only have an hour to use on a project on a day, it is good to know where time goes. In general amount of effort required for unit testing is not really linear to amount of code written, but rather amount of code for the feature tested. That explains why bug fixes (which often touch a rather small portion of code) are more test-writing intensive tasks, in cases where no unit tests covered the functionality. I guess you can think of it as paying for test code after-the-fact, when needed (i.e. when a user reports an issue).

3. Is it worth it?

So far I am still ok with the costs, even if speed of progress is probably cut in half (compared to the "devil-may-care" approach). Although I could just take a chance with bug fixes (and in most cases that would actually work ok, based on past experience) and save my own time, that would come at cost for users who would essentially beta-test the product and report (if not completely fed up) bugs. This initial lower quality would be detrimental to perception of project, probably reducing or slowing adoption and indirectly then reducing actual testing.

But another important aspect is that on longer term unit test suite becomes indispensable tool supporting refactoring efforts: you can actually make drastic internal spring cleaning to the codebase, with reasonable confidence that if your changes actually do break something, you will know it via unit test failures. And this is probably the biggest overall benefit for long-lived projects: it helps code stay alive, as no part of code becomes "untouchable"

4. How about "bread-n-butter" code?

One more interesting thing to consider is this: while I maintain that usually unit testing is of less value for your typical day-in day-out corporate code -- at least on sort term -- perhaps it may still valuable for long term. That is, if (and only if!) it is expected that the system in question should become a long-living thing, a strong test suite should be built over time: especially to avoid brittleness that comes with lack of automated testing. And conversely, shorter-lived things just do not warrant big investment for extensive test suite. Less unit testing will be enough, with perhaps added external (QA) testing.

The real trick is of course predicting which kind of a life-cycle a system in development will have. And that is actually a hard problem. :-)

Tuesday, March 03, 2009

Saving Software Developers from Franken-JSON

1. The Setup

When it comes to using JSON for communication, developers often get bait-and-switched by existing frameworks, most of which are designed to use xml as the underlying data interchange format. Spiel usually goes along these lines:

  • Want to use JSON?
  • We can help with that!
  • Just need to use a little thing called convention
  • which allows us to use our existing system as is
  • and things will work just fine
  • don't you worry little coder

Well, except that the "little thing" turns out to be this ugly contraption that mangles clean Json into something that is technically still Json content, but just barely so: in reality it's just a bit of spit and polish to mask underlying xml structure. Sort of like those human masks on lizards on that ancient TV show ("V"). And such systems likewise require similar bastardization as input.

For one example of such convention, consider Badgerfish (which is not the only choice; there are other variants such as "mapped").
Resulting Json content is generally along the lines of "only-mother-could-love" ugliness: for example let's consider how a simple Bean would get transformed to "JSON" using this convention.
Given a bean like:

class Person {
int getAge();
String getName();

we would get xml like:

<name>Joe Blow</name>

which would be ok as xml, but would then be mapped to "Json" that looks something like:

"person" : {
"age" : {
"$" : "37"
"name" : {
"$" : "Joe Blow"


Or, with "mapped" variant, xml could translate to this:

"person" : {
"age" : 37,
"name" : "Joe Blow"

(if you are lucky -- number may or may not look like one; and String that looks too much like a number may become one... after all, XML content usually has no such type information)

instead of what you might expect:

 "age" : 37,
"name" : "Joe Blow"

Like it? Me neither. And it gets even worse if you plan to use things like Lists, and some combinations are just illegal (Lists of Lists).

2. Why oh why?

Now: there are legitimate reasons to use such mapping conventions: sometimes one has to do convert things into other uglier things for interoperability reasons. Legacy systems may be practically unmodifiable, and require certain kinds of butt ugly inputs. There are also plenty many XML-based systems, frameworks and protocols in existence.
So it is natural to consider easy ways to make use of this existing infrastructure. And finally, most JSON tools up until fairly recently seem to be have been made by Santa's elves, and professional-grade tools are somewhat of a recent development.

But the fundamental problem is this: if you need and/or want to use XML, use it. And if you do use XML, what purpose would JSON serve in the mix? It's like connecting a Porsche engine to a Jugo transmission system, all enclosed in a Ford Pinto chassis. Or, brains of mr. Ab Normal within poor old Frankenstein's skull.
And even within Javascript execution environment, DOM-based xml trees have been available almost since dawn of the Interweb -- you can already use XML easily. So to have some actual benefits, JSON should be accessed natively, au naturel. Not as a round peg slammed through the square hole.

3. Solution

But why all these ugly workarounds? Perhaps framework builders do not see much need for JSON, and just want to be buzzword compliant? Or perhaps they have been turned off by earlier crop of toy JSON tools.
Whatever the underlying reason is, most seem to just offer the backhanded convention-ridden route.

But here is some good news. Jackson project can help JSON-craving developers; now with a Content conversion provider that is designed to work with any compliant JAX-RS implementation: JacksonJsonProvider is the thing you need to hook up to your JAX-RS implementation (tested with Jersey, as well as RESTEasy). You can download the jar here (look for "jax-rs" jar; uses core and mapper jars).

So what is this mysterious provider? It is the thing that JAX-RS needs to know how to convert from Json to POJOs and back. And as long as you hook this up as a resource you only need to indicate input and output as being Json (with @Consumes and @Produces resource annotations, or by client provided MIME headers that indicate the same) and it will all actually work out, and quite as you would expect it to.

The only remaining detail is that of hooking things up. One way is to write a simple servlet: here's one way to do that:

import java.util.*;
import org.codehaus.jackson.jaxrs.JacksonJsonProvider; public final class RestApplication extends Application {
// Best to register as a singleton to allow reuse of ObjectMapper: public Set<Object> getSingletons() { HashSet<Object> singletons = new HashSet<Object>(); singletons.add(new JacksonJsonProvider()); return singletons; } }

For production servlet, you would also want to register actual resources; but just to get the provider hooked up this should be enough.

4. Disclaimer

The provider class is part of just released version of Jackson 0.9.9. It has been tested only lightly, so there may be rough edges.

Code responsibly!

Sunday, March 01, 2009

Dubya: Hands Off my Martini!

No, not that "W". The other one, the hotel chain.

I recently had opportunity to attend a free (as in, paid for by entity other than myself or my family) evening event at the posh W hotel in Seattle. My previous (and only) experience with said establishment was a brief stay 4 years ago, sponsored by a large on-time retailer. That was a good experience resulting in 2 good things, one being my employment at the sponsoring company. The room was positively futuristic, ambiance good: all in all a nice stay.

So I had reasonably high expectations regarding libations to be offered at this company-sponsored event.

Unfortunately, execution was solidly sub-standard (unless there is a city ordinance that mandating company parties being required to consist of bland food and improperly prepared drinks? I may have missed a memo). Food was slightly below standard conference grub, which I find lacking for an evening event, especially for hotel chain of this caliber. But I can swallow that, with proper liquid accompaniment.

Which takes us to the real pain point: they killed my favorite drink, Martini. Not in a way of, say, stabbing it, or by brutal neglect, but by drugging and manhandling it; poisoned and shook to death. I'd call Booze Protective Services if such an agency existed (maybe in Vegas?).

The first offense was using vodka instead of gin, without asking me. There may be situations where this substitution needs to be done: who knows, maybe the desert island you were marooned on just has no gin to be found. But before committing this faux-pas, please be so kind as to ask me first. I would probably rather enjoy one of many vodka-based drinks, some of which can compete favorably with gin-based ones.

The second one, the more serious of mistakes (and I argue worth "2 strikes", enough for 3 strikes + out verdict) was the brutal handling of the resulting combination. I mean, loading vermouth, vodka and ice in a shaker is one thing; but vigorously shaking it is akin to shaking your baby, causing brain damage. What the poor shot of Grey Goose vodka had done we may never know, but punishment can't fit the crime. I felt like grabbing the shaker from perp when this was happening. It may be that you can't shake vodka to spoil it -- perhaps that's the excuse -- but it would be pathetic if a professional bartender thought James Bond quip about "shaking not stirring" has relevance to martini making. No, it's a joke. Subtle one, but one nonetheless.

So, from now on: W, back off, slowly, and hand me that shaker. And hand me your cocktail-concocting badge, and leave martini-making to others. Call me a snob, but there are certain things professional are not to do. What next? Put ice on red wine glasses?

ps. I recently learnt that experts recommend Plymouth as the gin for martini, followed by Noilly Prat for vermouth. After tasting these, I concur. It does taste like the ultimate combination. And this although I thought I liked Bombay Sapphire the best. You live and learn -- while others will do too (martini is a somewhat robust drink after all), you might as well try the best; price for these is no higher than for most quality alternatives. And for locals, yes, Washington State Liquor Stores carry both.

