Friday, August 11, 2006

Introducing StaxMate -- the perfect companion for your favorite Stax XML processor

Now that Woodstox 3.0.0 is released (see one of recent entries here at CowTownCoder), it is good time to introduce another even less widely known utility: StaxMate. Although it has been in development for quite a while, and even used by its author for almost a year, it has remained largely unnoticed, hopefully only due to its lack of documentation.

So what is StaxMate and why should I care?

If you are perfectly happy using raw Stax API, you probably do not need StaxMate. But if you have ever felt that using plain vanilla Stax API (especially cursor API) is... well, acquired taste, or at least bit inconvenient, you may want to have a look at StaxMate.

The raison d'etre of StaxMate is to add bit of "syntactic" (or should I say synthetic?) sugar and tad of cream, but with moderate amount of extra calories. That is, the overhead introduced should be nominal (less than that of using Event API), to allow effectively running fast Stax-based streaming processing, but in a bit more convenient fashion, by accessing XML the way it is structured (in hierarchic manner), and by focusing on things you really care about. I mean, really, usually one really does not care if there are comments within elements, or, for element-only content, if there is white space in there. For example, given following XML document:

<doc><!-- title follows -->
 <title>the title</title>
 <body>
  <abstract>Hi mom!</abstract>
 </body>
</doc>                    

Basic Stax Cursor API would feed you event sequence like:

   START_DOCUMENT
   START_ELEMENT (doc)
   COMMENT
   CHARACTERS (white space)
   START_ELEMENT (title)
   CHARACTERS (the title)
   END_ELEMENT
   ... (and so on)
   END_DOCUMENT                

Now while this is 100% accurate, it is also a nuisance to shift through all these events, if one just wants to know what is textual content of element 'abstract'. Why should I need to keep track of start and end elements, check out CHARACTERS that contains white space, or skip comments. The processor already has all the information, why do I have to write all the monkey code for traversing the sub-trees for skipping, and such? Shouldn't things be easier and "just work"?

I think they should. With StaxMate, you can think of terms of cursors that can ignore all events except for ones you care about: typically you want to only see elements (for non-mixed content) or elements and text. Further, when encountering an element, you may just want to get all the contained text, independent of any other XML events that may lurk in there (comments, processing instructions, unknown child elements). And finally, if you don't care about an element and sub-tree it contains (optional elements in your content model, for example), you can just ignore it by advancing the cursor. StaxMate can keep track of all the details for you. There are many more advanced features StaxMate can offer on the reader side (building partial tree of current element's parents, and/or previous siblings, for example), but the main point are really the things that make simple content processing tasks, well, simple.

Similarly, for output side, you can free yourself from namespace binding problems; as well as from having to keep track of how many end tags are needed. Output objects can keep track of what is needed and where, based on things you do want to add. Output side can also do simple heuristic-based indentation. And finally, for cases where document-order output just is not good enough, you can do some limited out-of-order output (for example, adding attributes to the parent element after adding child elements; or adding a place-holder, "dummy element", under which you can add other elements): in this case StaxMate can temporarily buffer your output for you, to be released once you are done with adding output.

If all of above sounds interesting (even if vague), I will try to write a simple sample web service using StaxMate, for my next blog entry. Stay tuned!

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.