Friday, July 11, 2008

Minor update for StaxMate in making: 1.2 with Sjsxp compatibility

It will be a while until StaxMate 2.0 is done, mostly since I have decided that it should offer full support for the new (and still work-in-progress...) Typed Access API. Since this beast will be included in Woodstox 4.0 (as part of Stax2 Extension API, curiously versioned at 3.0), it will be a while until all pieces are in place.

But in the meantime there are smaller (but tasty) fish to fry. For example, while it has been implied that StaxMate ought to work with any old Stax implementation, it really has only been tested on Woodstox. Of other implementations, there is just one that actually matters, Sun Streaming Java Xml Parser (Sjsxp), which is bundled with JDK 6 (and 7, I assume). After some testing, turns out that StaxMate 1.1 almost works with Sjsxp. But not quite. 2 particular fixes were needed to get all unit tests to pass. Given that this was relatively easy to do, and that results are useful (one can test StaxMate using, say, Sjsxp, and upgrade to Woodstox as necessary), I think this might just be the main new thing for StaxMate 1.2 maintenance release; and if other low-hanging fruit are found, there's always possibility of cutting 1.3.

Thursday, June 05, 2008

Woodstox 3.2.6 released: DOM writing, XMLReporter fixes

Yes, summer may be slow season in software business, but Woodstox patches keep on coming. This time there are 2 main issues that needed prompt patching:

  1. DOM-backed writer (one you get when constructing XMLStreamWriter using DOMResult) had a serious problem when trying to add namespace declarations ( WSTX-144)
  2. XMLReporter was not getting properly called for DTD-validation problems, but rather exception was directly generated ( WSTX-153).

These were the only changes, beyond adding regression unit tests to verify the fixes. Upgrade hence is only necessary if you use DOM-backed writers, or rely on XMLReporter. But then again, if you do either of those, you definitely want the update.

And now back to the regular programming, trying to get the "Typed Access API" (org.codehaus.stax2.typed.*) implemented in Woodstox. Stay tuned!

Tuesday, May 06, 2008

Performance of XML data binding on Java Platform

One of possible future projects on my sizable mental list has been that of comparing performance of open source xml/POJO data binding toolkits. For some reason there are not many actual up-to-date good benchmarks out there. In fact, I can not name even one (for example, BindMark project which might do the trick seems to be dead, and test code I looked at looks.... well, rotten...). My own limited testing has led me to suspect that the fastest current choice might well be JAXB 2 (it seems to have very low overhead over basic Stax parsing), but it would be nice to prove that. Besides, it would be good to check out how JiBX would fare: it is supposed to be highly performant as well.

But today I found this article. Very cool, there are actually people doing serious benchmarking too, and in the very area I would be interested in testing. It also does further support my thinking about limited overhead of not only JAXB 2, but the cool StaxMate helper library.

Thursday, April 24, 2008

Quick Introduction: project Aalto, from Cowtown Skunkworks

As some of you already know, Yet Another high-performance Java xml processor project was recently launched. Aalto xml processor is work-in-progress, and approaching its 1.0 release. I will try to write bit more on reasons behind starting this project on another entry, but for now it is enough to know that there are 2 main technical goals:

  1. Be Wicked Fast (check this out for some suggestions as to what is achievable)
  2. Implement Non-Blocking XML parsing mode (reads from underlying content do not block, but rather return EVENT_INCOMPLETE or such)

Both of these goals are already achieved to some degree: Aalto is almost twice as fast as Woodstox on many common documents (and hence matches or exceeds speeds of native code parsers like libxml2 -- I kid you not; likewise, binary xml parsers will get good run for their money when being compared to Aalto); and it does have experimental non-blocking (aka asynchronous) parser implementation. Challenges still remain, such as how to define standard extensions to support non-blocking mode.

For those interested in learning more, the important links are:

And how about immediate roadmap? Plan is to get Stax 1.0 API completed for 1.0 release (to be released within next few months), and the missing pieces are:

  • Implementation of coalescing mode (which, however, is missing from the Stax Reference Implementation, so hardly a must-have feature even if supposedly non-optional as far as Stax specs are concerned)
  • Implementation of repairing XMLStreamWriter

Other than these main features, the only significant missing thing is DTD-handling: Aalto does not parse DTDs (it does know how to skip internal subsets well), and although there is nothing fundamentally preventing from adding support, amount of work is big enough that it will not be done before 2.0 (if even then).

Anyway, hope to write little bit more about this exciting new (or, "new old"... project history is not all that short) project shortly. Don't switch the channel!

Saturday, April 19, 2008

How does one parse "XML" documents with multiple roots?

Ok, sure, title is bit of a trick question: after all, no xml document is allowed to have more (or less) than one root element. So the correct answer would appear to be "one does not". But there are ways to phrase this question more properly, for example by considering there to be implicit (and/oor, incomplete, insufficient, missing) framing -- failure of which to handle would lead to what looks like a "forest of xml documents". Or, perhaps one just wants to parse an "xml fragment", which can consists of multiple main level elements. And sometimes business reasons dictate one just has to deal with broken stuff. Money talks and bullshit gets worked with.

With this background, it is nice to know that Woodstox xml parser can indeed deal with such non-standard xml constructs. For details of how to do this, one has to venture into using Woodstox-specific input properties, specifically, use com.ctc.wstx.api.WstxInputProperties# P_INPUT_PARSING_MODE, and set (inputFactoryInstance.setProperty(...)) it to one of non-default values (PARSING_MODE_DOCUMENTS or PARSING_MODE_FRAGMENT). Best of all, you can just read this nice article for actual code samples and more musing on why this sometimes needs to be done. The article is, I think, yet another way user community is really what makes good things great, in the Open Source ecosystem. Maybe I should figure out a way to more systematically link to such stories from Woodstox project page?

Friday, March 14, 2008

XML 2.0!

No, version 2.0 of the xml standard has not been released (well, 1.0 update 5 tries to do something similar but that's another sad sad story...). However, Norm Walsh has some insightful and interesting musings on the subject here. I like to read Norm's suggestion in general, given his sensible and pragmatic approaches. It certainly beats reading through streams of nonsense at mailing lists like xml-dev. Granted, chances for getting xml version 2.0 (or, any sensible improvement beyond 1.0, it seems) can be estimated to be in the range between slim and none. But without good suggestions and proposals, chances are even weaker.

Friday, February 08, 2008

Woodstox pre-history

"So How Did It All Start?"

As an author of a highly publicized software package, one often has to answer questions like above. That -- in addition to the fame, money and chicks -- is one of fringe benefits of being an uber-geek programmer doG.

But, ask you: how does that relate to me? Sadly, not in any way, shape or form. But I just thought I'll start with such a claim to grab your attention.

Anyway: the other day I started thinking about how I would answer such a question if anyone ever asked it. Interestingly, I am already having hard time remembering exactly when did Woodstox project start, as well as important milestones there have been. This despite the fact that it hasn't been a life-long hobby (although countless hours have been spent on it). What is curious here is just the difference between new projects, where you can usually remember details well ("see, I added that feature on tuesday, this other one on thursday, and that thing will be done... say, tomorrow"). But it doesn't take all that long to start losing track of history.

In Woodstox' case, I am lucky enough to have left a trail of mini update notes via crude Woodstox news section at Codehaus. That log points to mid-2005 being one of highlights, release of version 2.0 with its full DTD-validation (full albeit not fully compliant, as I learn during 3.0 development...). That is a starting point, at least, outlining more recent, and somewhat less active, development history (although, granted, 2.0 -> 3.0 development cycle may have burnt more time than either getting 1.0 or getting 2.0 out the door).

But there was obviously some history before 2.0 release. I am no Microsoft, and my versioning scheme does not skip Important Release Numbers like 1.0!

Ok, but what REALLY happened?

Now, while I do not keep diary, I happen to have something similar at my disposal. My old linux desktop file system still has a copy of the earliest Woodstox home page (which, back in the day, was hosted here at cowtowncoder.com -- heck, I suspect I ordered the domain just to have a kewl domain for Woodstox, if my memory serves me), along with matching timestamps for downloadable files. So sequence of events, assuming timestamps are correct, is as follows:

  drwxr-xr-x  3 tatu tnt  4096 May 30  2004 0.7
  drwxr-xr-x  4 tatu tnt  4096 Jun 21  2004 0.8
  drwxr-xr-x  4 tatu tnt  4096 Aug 13  2004 0.8.8
  drwxr-xr-x  4 tatu tnt  4096 Aug 13  2004 0.9.0
  drwxr-xr-x  4 tatu tnt  4096 Aug 25  2004 0.9.1
  drwxr-xr-x  4 tatu tnt  4096 Oct 11  2004 1.0-final
  drwxr-xr-x  4 tatu tnt  4096 Oct 23  2004 1.0.1
  drwxr-xr-x  4 tatu tnt  4096 Nov 14  2004 1.0.2
  drwxrwxr-x  4 tatu tnt  4096 Mar  2  2005 1.0.3
  drwxrwxr-x  4 tatu tnt  4096 Mar 10  2005 1.0.4
  drwxrwxr-x  4 tatu tnt  4096 Mar 23  2005 1.0.5

So, given that I think version 0.5 must have been the first official release (ok ok, so my versioning scheme does have its quirks starting with "half versions"), which probably was cut somewhere in March 2004, Woodstox project could soon celebrate its 4th birthday. As such it has probably outlived commercial systems I have ever implemented...

Now, looking back in time is enlightening in many ways. Although exact release dates in this case are of historical curiosity value, if any, they can have surprisingly high indirect value: they can help one remember more important things that happened at around same time. For example, the thing that triggered my starting the project must have been the death of my-then-coolest project at Sun Microsystems (the [in]famous Voyager content management system!). Seeing it get killed due to politics, to be replaced by something more absurdly stupid than the Bad-News-powered rocket ship from the Hitchhiker's guide was just the kick in the balls I needed to spend less time at work, and focus on something cool outside working hours. That, and a very session by the Great Pragmatic Programmer mr. Dave Thomas, that I also listened to at about same time, explaining why one must have The Plan for skill development. Developing an xml parser may not seem like the optimal choice, but it "just seemed like a good idea at the time".

Woodstox project also was one of things that made it easier to stay at that job (although switching to a different team to avoid having to implement that idiotic replacement system) for almost a year, before I could move on to better things (despite having basically lost all my respect for my managers and company, not to mention motivation). And interviews to this Better Thing (which is still my job after this time) must have occured at almost exact day I released version 1.0! (or, rather, night after I flew back from Seattle). And, very very close to 9 months before birth of my younger daughter... :-)

Scary but true -- when it's raining, it's pouring. I better check timeline of Jackson and Aalto Xml Processor in a year or two. Or perhaps even my First Open Source project, JUG (Java Uuid Generator).

Wednesday, February 06, 2008

Maintenance releases: Woodstox 3.2.4, Jackson 0.8.2

Public service announcements for 2008

Quick update on state of projects: both Woodstox and Jackson projects released minor bug-patch versions to kick off the new year.

Releases are available from respective home pages, and here are quick links for change lists:

Friday, December 07, 2007

StaxMate 1.1 released

Another quick (if not timely) release note: version 1.1 of StaxMate was released a week ago. This release contained some minor bug fixes and improvements, and most important changes (significant internal refactoring) should not have user visible effects. But it is an important intermediate release nonetheless, mostly because refactoring should make the cursor synchronization (handling of access via multiple cursors correctly independent of how accesses them) handling much more robust.

I have also been thinking quite a bit about possible XInclude support, since XInclude is a rather useful feature when one has to modularize xml configuration files. It basically replaces one of the last things for which DTDs are useful, that is, ability to refer to external parsed entities. There are some interesting challenges in trying to make things work transparently, however, especially since StaxMate would need to be able to instantiate additional XMLStreamReader instances during traversal of document that does XInclude inclusion.

Friday, November 23, 2007

W3C Schema Validation with Woodstox

To All You Schema Lovers

(... yes, both of you)

Ok, so maybe not many software developers truly love W3C Schema, deep down in their cold cold hearts. But the fact is that it may be ugly, bulky and all (and unlikely to grow into a swan too!), but it also has its uses. It is used as data typing language for things like Soap and such. Occasionally it may even be useful for its original raison d'etre, validation of xml documents. So if the earlier validation support in Woodstox (DTD, RelaxNG) was not enough, now you can finally also validate documents you read (and write!) against W3C Schemas. This is possibly with Woodstox 4.0 version, including the first pre-4.0 preview release, 3.9.0 (fresh out of oven).

So how do I use it?

If you have been following this blog for a while, you may recall that this has already been covered -- given that same Stax2 API is used for all validation, be it for reader- or writer-side validation, and whichever supported schema language, all you have to do is indicate the correct type, and validate exactly the same way as you would validate against a RelaxNG schema. For others, here are some helpful pointers:

Essentially it all boils down to these simple steps:

  1. Get a schema factory that knows how to parse W3C Schema instances
  2. Ask factory nicely to parse a schema document and return you the resulting Stax2 validation schema object (be sure to ask very nicely, otherwise it'll insist you must provide something less daft, like RelaxNG schema instead! Sending a small bottle of Gran Marnier or PayPal donation to Woodstox author might help as well)
  3. Construct the stream reader/writer, and tell it to use schema object for validation
  4. Read/write xml content; this is needed as validator gets called when content is read or written.

Which might look something like:

  XMLValidationSchemaFactory sf = 
  XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA);
  XMLValidationSchema vs = sf.createSchema(new 
  URL("http://www.w3.org/schema/sample.xsd"));
  XMLStreamReader sr = 
  XMLInputFactory.newInstance().createXMLStreamReader(new 
  FileInputStream("mydoc.xml"));
  try {
 while (sr.hasNext()) {
  	  sr.next();
  	}
  	System.out.println("Validated ok!");
  } catch (XMLValidationException ve) {
System.err.println("Validation problem: "+ve);
  }
  sr.close();

Or something.

And for something actually cool, try the same when you are writing xml content. Instead of just catching crap some other system sends you (by diligent validation of incoming content), how about do some more due diligence, validate your own output and avoid sending garbage to others!


Sponsored By

Related Blogs

(by Author (topics))

Recommended Tools

Powered By