Tuesday, August 29, 2006

Woodstox 3.0.1 released

It is only couple of weeks since the 3.0 release, but due to high adoption rate (news at ServerSide quadrupled traffic at woodstox.codehaus.org!), bug reports started arriving soon after the release. The biggest contributing factor seems to be the fact that newcomers tend to use wider range of functionality, and thus find problems along the unbeaten path... At any rate, a few important bug fixes have been done since the initial release, so it seemed like a good idea to release the minor update, to make sure all fixes are available in 'binary' form.

For specific fixes, please refer to the release notes. Most of the problems fixed were not commonly encountered; so if you have not had any problems so far, it is not a required update. However, chances of regression bugs should also be minor due to limited scope of changes.

The next short-term goal will be 3.1 minor release: the exact feature set is still open, but here are some planned improvements:

  • Improve reporting of SPACE events in non-validating DTD-aware mode (currently only reported in validating mode)
  • Remove some of asynchronous ("lazy") exceptions, so that regular XMLStreamExceptions can be thrown from next(), instead of unchecked exceptions from methods like XMLStreamReader.getText()
  • Simple heuristic indentation (already implemented for StaxMate, maybe possible to add easily to XMLStreamWriter itself)
  • Xml:id implementation?

Other feature requests will also be considered if suggested: one place to check out things that are on Woodstox team's radar is Jira. It is also the best place to suggest new features and improvements (along with the mailing lists)

Friday, August 18, 2006

StaxMate basics, reader side

Before embarking on journey to build a simple web application (we'll get there), it is necessary to explain the costructs we will use, so that the example itself only needs comments regarding actual functionality. So, here are typical usage patterns on reader side.

1. Getting started

First things first: since StaxMate is built on top of Stax API, you need to create an XMLStreamReader, properly configured. StaxMate is quite adaptive, and Stax defaults are usually sufficient so usually you can just use something like:

  XMLInputFactory f = XMLInputFactory.newInstance(); // remember to reuse
  XMLStreamReader sr = f.createStreamReader(new FileInputStream("mydoc.xml"));                

2. Start iterating

All access to the input document is handled using cursors. Cursors come in two basic types: hierarchic cursors (also known as "child cursor", as they traverse immediate children) and flattening cursors (or descendant cursors, as they traverse all descendants, children, grand-children and so on). In addition to the traversal type, the second important property of a cursor is type of events it filters out: SMFilter interface can be implemented to specify which underlying events to filter out and which not. Most of the time default ones (element-only filter, text filter, mixed filter) are good enough, and there are convenience method for constructing other types you are likely to need.

At this point, we need to obtain the first cursor; and usually it is a hierarchic element-only cursor, since we usually do not care about possible comments or processing instructions outside of the root element, but do we want to handle things hierarchically (level by level). So, most often you will get the root cursor by doing:

  SMInputCursor rootCrsr =  SMInputFactory.rootElementCursor(sr);               

So what does cursor point to having been created? Nothing, as of yet: similar to JDBC result set cursors, StaxMate cursors need to be advanced to point to the next applicable event (if any). They will return type of the event, if there was one (for root element there will always be a root element, for any well-formed document), or null otherwise. Return type is a type-safe enumeration (StaxMate requires Java 5 aka 1.5). So, typically you will see something like this:

  rootCrsr.getNext(); // for element cursor, return type will be SM Event.START_ELEMENT        

after which you can check that the element is what you think it should be (and/or do other validation):

  assert(rootCrsr.getQName(), new QName("root"));                

and perhaps access an attribute or two:

String id = rootCrsr.getAttrValue("id"); // convenience method for attrs without namespace        

and when you are ready to inspect the sub-tree starting from root, you do:

  SMCursor childCrsr = rootCrsr.childElementCursor();                

3. Collect text

One common thing to do is accessing textual content of a leaf element. Although Stax XMLStreamReader does have 'getElementText' method, it is bit tricky to use, and will not work for mixed content (if there are child elements). Further, you still need to skip the end element after getting text.

With StaxMate, you just do (assuming crsr points to a SMEvent.START_ELEMENT):

  String value = crsr.collectDescendantText()            

and you get all text element contains, recursively if necessary, all non-text content being stripped out.

4. Share (and enjoy!) the cursor

One of most mundane chores with Stax is the book-keeping with end tag balancing, and especially so if you want to modularize your code. If so, the called code has to be careful to match and skip all end tags for start tags it has handled. This is tedious and error-prone; and with enough code can make code harder to read than necessary. It also makes it very easy for called code to wreak havoc, by over-iterating over events it is not supposed to read. This because there is just one XMLStreamReader.

Here StaxMate can help, not only because you never need to deal with end tags directly (when underlying stream hits an end tag, cursor knows it can't advance, and this is signalled by returning null -- plus, it is still safe to call getNext() again; you will just get another null), but also because all cursors are scope such that they can only traverse over events within scope. That is, a child or descendant cursor constructed for cursor point to, say, start tag <tag> can only traverse over events up to the </tag> that matches the start tag. And finally, even if the child cursor does NOT traverse through all the events (called code gets bored, or found what it was looking for), parent cursor knows how to automatically skip the "uninteresting" events in-between. That is, cursors are kept in-sync.

So, quite often you will see method calls like:

  handleHeadSection(crsr.childElementCursor());
  crsr.getNext();
  assertElement(crsr, "body");
  handleBodySection(crsr.childElementCursor());
  crsr.getNext();
  assertElement(crsr, "trailer");
  handleTrailer(crsr.childElementCursor());
  // ...    

in which different handlers take care of different parts of the document, and without having to keep track of anything beyond its immediate needs.

5. The limitation

So what's the catch? Can I now freely create and traverse cursors, even if only in forward direction? Yes and no: there is one major fundamental limitation. All access will still have to be done in document order. So that:

  1. Parent element information has to always be completely accessed before child element information (that is, you can not access parent information [except if tracking is enabled -- but this uses different methods] after a cursor has advanced to a child element). Parent information includes attribute information, so that it is not possible to access attribute values of a parent, after advancing a cursor to a child element.
  2. Siblings have to be accessed in the document order (although you can use tracking here, too). This is seldom a problem, since cursors only advance in one direction.

Similar limitations apply to the output side as well, although there too there are ways around ordering (specifically it is possible to use feature called "buffering" to delay outputting of an element, allowing limited out-of-order addition of output: this is most often used to add attributes after children are added).

How big is this limitation? It is no worse than the basic Stax API limitations, but it may be easier to ignore. However, if you understand the basic operation, and keep in mind the implied (but strongly enforced!) ordering restriction, you should be able use cursors quite conveniently and efficiently.

6. Advanced Features

In addition to the basic hierarchic iteration, and convenient access to data, there are other more advanced features StaxMate input side offers. Since this is a tutorial article, these will not be explained in detail, but here is a short list of additional advanced features you can learn from the source code (or possibly later tutorials):

  1. Access to simple positional indexes. StaxMate keeps track of node and element order number for cursor, and offers access to it. So, for example, if you need different handling for the first <li> child element, and following ones, you can do this by checking 'crsr.getElementCount()'.
  2. Customized event filtering. If you are only interested in, say, comments in the document, you can easily implement SMFilter, or even just construct SimpleFilter with proper (Stax API - based) event flags.
  3. Tracking: simply put, this allows retaining of parts of the input structure even after cursors have been moved past that content. Tracking can be dynamically enabled on sub-trees: and when enabled, a temporary tree-like (but very light-weight) structure is maintained. This may be useful for simple state-tracking
  4. Customize (override) all object creation methods, to store additional information in cursor objects, or tracking info objects. All of the factory methods are designed to be overloadable, so if you need a light-weight additional storage for storing state information, you can just sub-class objects you need, and the framework should be able to use your sub-classes instead of the default ones.

Friday, August 11, 2006

Introducing StaxMate -- the perfect companion for your favorite Stax XML processor

Now that Woodstox 3.0.0 is released (see one of recent entries here at CowTownCoder), it is good time to introduce another even less widely known utility: StaxMate. Although it has been in development for quite a while, and even used by its author for almost a year, it has remained largely unnoticed, hopefully only due to its lack of documentation.

So what is StaxMate and why should I care?

If you are perfectly happy using raw Stax API, you probably do not need StaxMate. But if you have ever felt that using plain vanilla Stax API (especially cursor API) is... well, acquired taste, or at least bit inconvenient, you may want to have a look at StaxMate.

The raison d'etre of StaxMate is to add bit of "syntactic" (or should I say synthetic?) sugar and tad of cream, but with moderate amount of extra calories. That is, the overhead introduced should be nominal (less than that of using Event API), to allow effectively running fast Stax-based streaming processing, but in a bit more convenient fashion, by accessing XML the way it is structured (in hierarchic manner), and by focusing on things you really care about. I mean, really, usually one really does not care if there are comments within elements, or, for element-only content, if there is white space in there. For example, given following XML document:

<doc><!-- title follows -->
 <title>the title</title>
 <body>
  <abstract>Hi mom!</abstract>
 </body>
</doc>                    

Basic Stax Cursor API would feed you event sequence like:

   START_DOCUMENT
   START_ELEMENT (doc)
   COMMENT
   CHARACTERS (white space)
   START_ELEMENT (title)
   CHARACTERS (the title)
   END_ELEMENT
   ... (and so on)
   END_DOCUMENT                

Now while this is 100% accurate, it is also a nuisance to shift through all these events, if one just wants to know what is textual content of element 'abstract'. Why should I need to keep track of start and end elements, check out CHARACTERS that contains white space, or skip comments. The processor already has all the information, why do I have to write all the monkey code for traversing the sub-trees for skipping, and such? Shouldn't things be easier and "just work"?

I think they should. With StaxMate, you can think of terms of cursors that can ignore all events except for ones you care about: typically you want to only see elements (for non-mixed content) or elements and text. Further, when encountering an element, you may just want to get all the contained text, independent of any other XML events that may lurk in there (comments, processing instructions, unknown child elements). And finally, if you don't care about an element and sub-tree it contains (optional elements in your content model, for example), you can just ignore it by advancing the cursor. StaxMate can keep track of all the details for you. There are many more advanced features StaxMate can offer on the reader side (building partial tree of current element's parents, and/or previous siblings, for example), but the main point are really the things that make simple content processing tasks, well, simple.

Similarly, for output side, you can free yourself from namespace binding problems; as well as from having to keep track of how many end tags are needed. Output objects can keep track of what is needed and where, based on things you do want to add. Output side can also do simple heuristic-based indentation. And finally, for cases where document-order output just is not good enough, you can do some limited out-of-order output (for example, adding attributes to the parent element after adding child elements; or adding a place-holder, "dummy element", under which you can add other elements): in this case StaxMate can temporarily buffer your output for you, to be released once you are done with adding output.

If all of above sounds interesting (even if vague), I will try to write a simple sample web service using StaxMate, for my next blog entry. Stay tuned!

Thursday, August 10, 2006

Using Stax2 (Woodstox 3.0) Validation API, part 3

Continuing on the theme of validating XML content processed with Woodstox, using Stax2 extension of Stax API, let's do something more interesting: validate content as it is getting written (note: the full source code for the example shown below can be found from http://woodstox.codehaus.org/DocStax2Validation).

So, here is piece of code that will demonstrate how to validate XML output being written (using XMLStreamWriter), using Stax2 API extension.

final String DTD_STR = "<!ELEMENT root (branch | leaf)*>\n"
  +"<!ELEMENT branch (leaf)+>"
  +"<!ELEMENT leaf (#PCDATA)>"
  +"<!ATTLIST leaf desc CDATA #IMPLIED>\n";
StringWriter strw = new StringWriter();
// First, let's parse DTD schema object
XMLValidationSchemaFactory sf = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD);
XMLValidationSchema dtd = sf.createSchema(new StringReader(DTD_STR));
XMLOutputFactory ofact = XMLOutputFactory.newInstance();
XMLStreamWriter2 sw = (XMLStreamWriter2) ofact.createXMLStreamWriter(strw);
sw.validateAgainst(dtd); // this starts validation
// Document validation is done as output is written
try {
 sw.writeStartDocument();
 sw.writeStartElement("root");
 sw.writeStartElement("branch");
 sw.writeStartElement("leaf");
 sw.writeEndElement();
 // We'll get validation exception here -- branch not allowed within branch
 sw.writeStartElement("branch");
 sw.writeEndElement();
 sw.writeEndElement();
 sw.writeEndElement();
 sw.writeEndDocument();
 sw.close();
} catch (XMLStreamException xse) {
 System.err.println("Failed output the document: "+xse);
}

You may notice some similarity with the earlier reader side example (and if not, you may want to have another look!). The pattern is quite simple: obtain a schema object from schema factory, passing in schema content from any of typical content sources (InputStream, Reader, javax.xml.transform.Source), and start validating content being read (using XMLStreamReader) or written (using XMLStreamWriter). How is that for simplicity? Even more advanced things like chaining multiple instances of validators, or doing partial validation, just use these basic mechanisms (ok, except for partial validation also needing to use method stopValidatingAgainst()...)

Now, what is the point of validating output? Since you write output code, shouldn't you be able to do it just fine with normal testing? In above example there isn't much need for validation, obviously, but there are other cases where output validation makes sense. For example:

  • During testing, you may want to enable strict input and output side validation, as assertions verifying correctness of code, even if you disable validation in production. And even in production, you may be able to easily re-enable validation as needed.
  • When doing transformations, it is hard to cover all the possible outputs that might result: even worse, when using technologies like XSLT, there is no formal way of (statically) ensuring that the output will conform to a given schema. But you can assert validity on output side quite simply by validating against specific schema.
  • When pipelining XML content, it may be easier (and more efficient) to plug in processing component between output stream writer, and actual physical output, than having to write output to a temporary location, and then parsing for validation.

Another question is what is the specific point of using Stax2 validation, over, say, using stand-alone validators or plugging in SAX-based validators. One benefit is that validation done as part of reading/writing XML is likely to be more efficient, as input/output is only parsed/generated once. Also, diagnostics regarding the problem are likely to be more accurate when validation is synchronized with actual processing.

As to validation schema objects, it is worth noting that these schema objects are fully reusable (actual validators that are created from schemas are not; calls to startValidatingAgainst() create validator objects behind the scenes), as well as thread-safe. This means that in general you can just create validation schema objects once when the system starts up (for static set of schemas at least), and fully reuse afterwards.

Given that it is easy to validate XML output this way, I hope that more developers will make use of this feature. I am also interested in hearing about experiences from doing this (feedback can be sent to stax_builders mailing list, for example).

Tuesday, August 08, 2006

Third Time's the Charm -- Woodstox 3.0 released!

Lo and Behold, the day is here. Woodstox 3.0 is finally released! For minute details of what has changed since the last release candidate, you can check out the release notes. And at higher level, I already listed the high-level changes since 2.0. So what else is there to be said about this version?

The main thing is that 3.0 is now considered the stable release. It has been extensively tested, retested, regression tested, using not only StaxTest conformance test suite and Woodstox' own JUnit test suite, but also using the vast automated end-to-end processing test suite of Nux (with above 45000 sample documents). So the quality should be superior to that of 2.0 series, including all-around improved standards compliance (both in regards to XML in general, and Stax specifically).

So, if you have been using Stax 2.0.x up until now, it is a very good time to upgrade. It will be worth it.

Friday, August 04, 2006

Using Stax2 (Woodstox 3.0) Validation API, part 2

Ok, first things first: the code sample I will go through can be found from http://woodstox.codehaus.org/DocStax2Validation, and will be part of Woodstox source code distribution (in src/samples/).

So here is the basic usage pattern for using Stax2 validation API on reader side. Example will validate a document read via XMLStreamReader, but same could easily be done by with XMLEventWriters: you just need to first construct the stream writer, and then event writer using that specific stream reader. Order of attack is as follows:

  1. Get an instance of XMLValidationSchemaFactory that knows how to parse schemas of the type you need (RelaxNG == rng for this example).
  2. Ask factory to construct a XMLValidationSchema, given a resource (file, URL, InputStream, Reader): it will parse the resource as necessary.
  3. Construct your Stax stream reader as usual
  4. Enable validation using schema you got from step 2
  5. Traverse the input document using stream reader -- this is necessary, since validation is done in fully streaming manner.
  6. There is no step 6!

Sound simple enough? Ok, here is the source code (minus comments, error handling and class declaration of the actual sample class -- for full class, see the link above), with comments indicating where each step starts:

// step 1: get schema factory
XMLValidationSchemaFactory sf = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_RELAXNG);
// step 2: construct validation schema instance
XMLValidationSchema rng = sf.createSchema(new File(args[1]));
// step 3: construct stream reader
XMLInputFactory2 ifact = (XMLInputFactory2)XMLInputFactory.newInstance();
XMLStreamReader2 sr = ifact.createXMLStreamReader(new File(args[0]));
// step 4: enable validation
sr.validateAgainst(rng);
// step 5: stream through the document:
while (sr.hasNext()) {
  sr.next();
}
// done!
   

And there you have it: simple validation of an xml document, against an Relax NG Schema. The exact same procedure (except for getting a different schema validation factory in step 1) would work for other types, specifically for DTDs. And in future, with other pluggable schema factories, for other schemas like W3C Schema as well.

Now that you know how to do this simple task, possible next tasks would be:

  1. Validating XML document you are writing against a Schema: not surprisingly, code looks very similar to above. In fact, you only need to change steps 3 and 5!
  2. Validating a single document against multiple schemas. Just repeat steps 2 and 4 multiple times!
  3. Writing your own custom schema validators, to separate business level data validation from access.
  4. Validating sub-trees, possibly against different schemas.

I will try to find time to write about some of above ideas in near future. Stay tuned!

Thursday, August 03, 2006

Using Stax2 (Woodstox 3.0) Validation API, part 1

One of the new features of Woodstox 3.0 is its completely redesigned and reimplemented validation system. Changes are complete, as both the interface (2.0 implemented basic Stax 1.0 API, and simple property-based native extensions) and the implementation (2.0 had in-built DTD validator) have been completely re-built.

The new interface to the validation sub-system is via experimental Stax2 package (defined under org.codehaus.stax2 package and its sub-packages, included in Woodstox distribution). This is in addition to the basic "enable DTD validation" property that was all that the original Stax 1.0 API defined in regards to validation. Internal implementation of the DTD validator was changed to be accessible via this new interface, and an additional optional Sun's Multi-Schema Validator based Relax NG validator was also added (initially it was hoped that a W3C Schema validator would also be included, but this was deferred until after 3.0 release).

The main features of the new Validation API can be summarized as follows:

  • Fully bi-directional: both documents processed with Stream/Event Readers AND Writers can be validated against same schemas, using same interface. Schema and validator instances work on both, since the interface they define (and context they get) is identical.
  • Implementations are pluggable: Schema instances are created using factories similar to basic Stax 1.0 XMLInputFactory and XMLOutputFactory (org.codehaus.stax2.validation.XMLSchemaValidationFactory), and registered using standard service definition mechanism.
  • Validators are chainable: one can use more than one validator per input/output processor.
  • Dynamic enabling/disabling of validators: it is possible to start/stop validation mid-stream (within constraints that the validator implementations may impose): specifically, it should be possible to validate sub-trees, instead of complete documents.
  • Possible to register error handlers, to implement different validation error handling strategies: from fail-fast to collect-all-problems or somewhere in between.
  • High-performance streaming validation: interface is designed to avoid unnecessary overhead when passing content to validate, so that implementations can try to optimize for performance.

So how does one use the new API? I just recently added first 2 sample classes into Woodstox distribution, to show-case simple reader-side validation. These classes are under 'src/samples' in Woodstox SVN repository, for those who need to learn it now.

Tomorrow I will show specific examples (based on above-mentioned sample classes), to show how simple validators can be written using Woodstox 3.0 and its Stax2 Validation API. Stay tuned!



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.