Thursday, December 04, 2008

How to alleviate the infamous "Xml Invalid Character" problem with Woodstox

1. The Problem

Have you ever hit a problem manifesting itself like so:

  Error: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character 
  ((CTRL-CHAR, code 12))

when parsing XML? There is a good chance you have, if you regularly process xml.

So how and why does it occur? For common use cases, where one controls both data source (stuff that gets written as xml) and reading side, problems seldom occur. Few people feel the urge to add such weird characters in the first place. It is usually only when a legacy data source is used for populating/constructing xml content; or when using a data source with very loose validation. Former often comes as a (c)rusty old Oracle data dump, and latter simplistic system where user data (from web form or such) is shoved straight into crusty old Oracle, to create the data compost (which eventually becomes legacy data). Either way, characters that cause problems are more often than not supposed to even be there.

But as importantly, while characters such as "vertical tab" or "form feed" are usually of little use nowdays (the are left-overs from days of past when one hand to use in-band signaling use crude mechanisms), they are also often non-problematic: web browsers, for example, mostly convert these to other harmless character codes (such as plain space) before displaying. So, they are colorless and tasteless. Expert for xml parsers, which are mandated by law (well, ok not quite, just by xml specs...) to report such irregularities.

So here's the catch: xml specification explicitly forbids using such character: as per XML specification, these characters can not be included in xml content, anywhere. Not in CDATA sections, not as attribute values, not in processing instructions (not with the mouse, not in the house, Sam... but I digress) With XML 1.1, you could actually use character entities to escape them. Too bad no one uses XML 1.1, and chances are few ever will (and this is due to, well, XML 1.1 sucking bad in many other respects -- one step fore, two back -- but I rantgress here).

2. Woodstox to Rescue

So what is one to do? Most developers intuitively reach for "how-do-I-disable-this-nasty-validation" button. Not so fast: while that is a possible work-around, it is not really a good solution. After all, broken "xml" content is still broken, you are just trying to sweep this inconvenient fact under the carpet.

Instead, one should try to rectify the problem at source. Now: although sometimes producer is not under your control (when you are being sent alleged "xml" content by someone not familiar with concepts like, say, xml...), quite often you do have control. If so, the first thing you should do is to verify that you never produce such pseudo-xml content with these evil characters. If not, you should pester the producer to read this blog entry. :-)

And this is where Woodstox 4.0.0 comes in [fanfare!]. Here is a new feature you might want to use to squash those pesky vertical tabs and their brethren:

  XMLOutputFactory f = new WstxOutputFactory();
new InvalidCharHandler.ReplacingHandler(' ')); XMLStreamWriter sw = f.createXMLStreamWriter(...);

So what does it do? If you didn't guess it, setting this property will make stream writer silently replace all Java characters that are not valid xml characters with given replacement character. This means that following unit test should pass:

  StringWriter w = new StringWriter(); sw.writeStartElement("a"); 
  sw.writeCharacters("Evil:\u000c!"); sw.writeEndElement(); sw.close(); 
  assertEquals("Evil: ");

That works quite nicely: I just started using it myself, for a simple DB-to-xml data dumper (and yes, an address had a form feed in it).

So if you are in the business of producing xml content, consider this a new tool for Greener data production. Woodstox to the rescue -- so that we can all breathe a little easier! (disclaimer: air pollution reduction not scientifically proven)

3. Small Print

Woodstox 4.0 is still in its pre-release phase, so while the latest release (3.9.9-1) has all the features detailed above funcioning correctly, the official release has not yet been cut. Use at your own risk. D(r)ink responsibly. But most of all -- have fun!

blog comments powered by Disqus

Sponsored By

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.