Saturday, April 19, 2008

How does one parse "XML" documents with multiple roots?

Ok, sure, title is bit of a trick question: after all, no xml document is allowed to have more (or less) than one root element. So the correct answer would appear to be "one does not". But there are ways to phrase this question more properly, for example by considering there to be implicit (and/oor, incomplete, insufficient, missing) framing -- failure of which to handle would lead to what looks like a "forest of xml documents". Or, perhaps one just wants to parse an "xml fragment", which can consists of multiple main level elements. And sometimes business reasons dictate one just has to deal with broken stuff. Money talks and bullshit gets worked with.

With this background, it is nice to know that Woodstox xml parser can indeed deal with such non-standard xml constructs. For details of how to do this, one has to venture into using Woodstox-specific input properties, specifically, use com.ctc.wstx.api.WstxInputProperties# P_INPUT_PARSING_MODE, and set (inputFactoryInstance.setProperty(...)) it to one of non-default values (PARSING_MODE_DOCUMENTS or PARSING_MODE_FRAGMENT). Best of all, you can just read this nice article for actual code samples and more musing on why this sometimes needs to be done. The article is, I think, yet another way user community is really what makes good things great, in the Open Source ecosystem. Maybe I should figure out a way to more systematically link to such stories from Woodstox project page?

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.