A Modest Proposal for Rewriting Woodstox, Jackson, using Protocol Buffers
(Update, April 2nd: Never mind: changed my mind -- revert back to the old course!)
After playing some more with Google Protocol Buffer implementation, I have become more and more impressed by it. It is easy to love its debuggability, expressiveness and extensive tool support. But most of all it is the performance aspects that have caught my attention. Those propeller heads at the big G have certainly gotten performance boosted to Infinity and Beyond (almost to 11, dare I say).
Given its superior performance, I figured it is pointless to continue working on direct approaches to parsing JSON and XML.
Instead, my new plan -- effective immediately! -- is to retool Woodstox (XML) and Jackson (JSON) parsers to make use of some PB goodness. Here is how I think it can be done.
1. For main parsing and generator, use Protocol Buffers
The core reading and writing of content should be done using Protobuf; and consequently all content needs to be in compact ProtoBuf binary data format.
While this is the obvious right way to go, it does add some problems
because existing legacy applications will expect "native" APIs to
process content. And on the other hand, legacy content will still be
using textual data formats in question.
So to make things work, there is need for just wee bit of "glue" both above and below ProtoBuf layer.
2. Below ProtoBuf, use Simple Light-weight converters
For XML, the natural light-weight translation mechanism from textual XML into PB format is XSLT, possibly augmented with XSLT 2.0 type information, derived from dynamically generated Schema Types. If necessary (for example, if performance is not as good as expected), these can be converted to binary XML during runtime (and possibly re-parsed using EXI Binary XML): this to minimize amount of processing done using inefficient textual format. And if nothing else works, it is always possible to add more layers to improve efficiency.
For JSON choices are more limited, but I am confident that some combination of JsonPath and YAML should do the trick. Another possibility would be to use something like BadgerFish mapping convention (for binary data, I am thinking of defining straight-forward complementary mapping, code named "StinkySock", but more on that later on).
3. Above ProtoBuf use Some More Converters
Above PB, some limited amount of glue is also needed, to produce kinds of events current crop of applications need (Stax, Jackson API). The simplest mapping for XML seems to be using SAX API first (since that is easy to expose). But as Stax sources can not use SAX (push vs pull), it will be necessary to use intermediate structure: DOM seems just like the simple thing to use. And since DOM can be read via DOMSource, it is easy to produce Stax events from there (Woodstox can actually already do that, which makes it totally trivial).
I will leave the details of converting from Protocol Buffer tokens into JSON for interested readers -- suffice it to say that it should be possible to concoct similarly simple and elegant solution as outlined above, without undue effort.
4. Want to know more?
Although there are still some details open, there is much more to discuss -- instead of boring you here, feel free to read more on my plans.
As usual, please let me know what you think -- I am very excited about this new approach!