StaxMate basics, reader side

Before embarking on journey to build a simple web application (we'll get there), it is necessary to explain the costructs we will use, so that the example itself only needs comments regarding actual functionality. So, here are typical usage patterns on reader side.

1. Getting started

First things first: since StaxMate is built on top of Stax API, you need to create an XMLStreamReader, properly configured. StaxMate is quite adaptive, and Stax defaults are usually sufficient so usually you can just use something like:

  XMLInputFactory f = XMLInputFactory.newInstance(); // remember to reuse
  XMLStreamReader sr = f.createStreamReader(new FileInputStream("mydoc.xml"));

2. Start iterating

All access to the input document is handled using cursors. Cursors come in two basic types: hierarchic cursors (also known as "child cursor", as they traverse immediate children) and flattening cursors (or descendant cursors, as they traverse all descendants, children, grand-children and so on). In addition to the traversal type, the second important property of a cursor is type of events it filters out: SMFilter interface can be implemented to specify which underlying events to filter out and which not. Most of the time default ones (element-only filter, text filter, mixed filter) are good enough, and there are convenience method for constructing other types you are likely to need.

At this point, we need to obtain the first cursor; and usually it is a hierarchic element-only cursor, since we usually do not care about possible comments or processing instructions outside of the root element, but do we want to handle things hierarchically (level by level). So, most often you will get the root cursor by doing:

  SMInputCursor rootCrsr =  SMInputFactory.rootElementCursor(sr);

So what does cursor point to having been created? Nothing, as of yet: similar to JDBC result set cursors, StaxMate cursors need to be advanced to point to the next applicable event (if any). They will return type of the event, if there was one (for root element there will always be a root element, for any well-formed document), or null otherwise. Return type is a type-safe enumeration (StaxMate requires Java 5 aka 1.5). So, typically you will see something like this:

  rootCrsr.getNext(); // for element cursor, return type will be SM Event.START_ELEMENT

after which you can check that the element is what you think it should be (and/or do other validation):

  assert(rootCrsr.getQName(), new QName("root"));

and perhaps access an attribute or two:

String id = rootCrsr.getAttrValue("id"); // convenience method for attrs without namespace

and when you are ready to inspect the sub-tree starting from root, you do:

  SMCursor childCrsr = rootCrsr.childElementCursor();

3. Collect text

One common thing to do is accessing textual content of a leaf element. Although Stax XMLStreamReader does have 'getElementText' method, it is bit tricky to use, and will not work for mixed content (if there are child elements). Further, you still need to skip the end element after getting text.

With StaxMate, you just do (assuming crsr points to a SMEvent.START_ELEMENT):

  String value = crsr.collectDescendantText()

and you get all text element contains, recursively if necessary, all non-text content being stripped out.

4. Share (and enjoy!) the cursor

One of most mundane chores with Stax is the book-keeping with end tag balancing, and especially so if you want to modularize your code. If so, the called code has to be careful to match and skip all end tags for start tags it has handled. This is tedious and error-prone; and with enough code can make code harder to read than necessary. It also makes it very easy for called code to wreak havoc, by over-iterating over events it is not supposed to read. This because there is just one XMLStreamReader.

Here StaxMate can help, not only because you never need to deal with end tags directly (when underlying stream hits an end tag, cursor knows it can't advance, and this is signalled by returning null -- plus, it is still safe to call getNext() again; you will just get another null), but also because all cursors are scope such that they can only traverse over events within scope. That is, a child or descendant cursor constructed for cursor point to, say, start tag <tag> can only traverse over events up to the </tag> that matches the start tag. And finally, even if the child cursor does NOT traverse through all the events (called code gets bored, or found what it was looking for), parent cursor knows how to automatically skip the "uninteresting" events in-between. That is, cursors are kept in-sync.

So, quite often you will see method calls like:

  handleHeadSection(crsr.childElementCursor());
  crsr.getNext();
  assertElement(crsr, "body");
  handleBodySection(crsr.childElementCursor());
  crsr.getNext();
  assertElement(crsr, "trailer");
  handleTrailer(crsr.childElementCursor());
  // ...

in which different handlers take care of different parts of the document, and without having to keep track of anything beyond its immediate needs.

5. The limitation

So what's the catch? Can I now freely create and traverse cursors, even if only in forward direction? Yes and no: there is one major fundamental limitation. All access will still have to be done in document order. So that:

Parent element information has to always be completely accessed before child element information (that is, you can not access parent information [except if tracking is enabled -- but this uses different methods] after a cursor has advanced to a child element). Parent information includes attribute information, so that it is not possible to access attribute values of a parent, after advancing a cursor to a child element.
Siblings have to be accessed in the document order (although you can use tracking here, too). This is seldom a problem, since cursors only advance in one direction.

Similar limitations apply to the output side as well, although there too there are ways around ordering (specifically it is possible to use feature called "buffering" to delay outputting of an element, allowing limited out-of-order addition of output: this is most often used to add attributes after children are added).

How big is this limitation? It is no worse than the basic Stax API limitations, but it may be easier to ignore. However, if you understand the basic operation, and keep in mind the implied (but strongly enforced!) ordering restriction, you should be able use cursors quite conveniently and efficiently.

6. Advanced Features

In addition to the basic hierarchic iteration, and convenient access to data, there are other more advanced features StaxMate input side offers. Since this is a tutorial article, these will not be explained in detail, but here is a short list of additional advanced features you can learn from the source code (or possibly later tutorials):

Access to simple positional indexes. StaxMate keeps track of node and element order number for cursor, and offers access to it. So, for example, if you need different handling for the first <li> child element, and following ones, you can do this by checking 'crsr.getElementCount()'.
Customized event filtering. If you are only interested in, say, comments in the document, you can easily implement SMFilter, or even just construct SimpleFilter with proper (Stax API - based) event flags.
Tracking: simply put, this allows retaining of parts of the input structure even after cursors have been moved past that content. Tracking can be dynamically enabled on sub-trees: and when enabled, a temporary tree-like (but very light-weight) structure is maintained. This may be useful for simple state-tracking
Customize (override) all object creation methods, to store additional information in cursor objects, or tracking info objects. All of the factory methods are designed to be overloadable, so if you need a light-weight additional storage for storing state information, you can just sub-class objects you need, and the framework should be able to use your sub-classes instead of the default ones.

Posted by Tatu Saloranta at Friday, August 18, 2006 11:26 PM
Categories: XML/Stax
| Permalink |Comments | links to this post

CowTalk

Moo-able Type for Cowtowncoder.com

Friday, August 18, 2006

StaxMate basics, reader side

Search

Last posts

Categories

Sponsored By

Archives

Related Blogs

Powered By

About me