Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

  1. It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
  2. It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (java.util.zip.Deflater).

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.


  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
feeder.endOfInput(); } } // and once we have full event, we just dump out event type (for now) System.out.println("Got event of type: "+type); // could also just copy event as is, using Stax, or do any other normal non-blocking handling: // xmlStreamWriter.copyEventFromReader(asyncReader, false); } while (type != END_DOCUMENT); asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

  1. Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
  2. Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
  3. Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

  • Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
  • There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

Monday, February 28, 2011

Jackson: not just for JSON, Smile or BSON any more -- Now With XML, too!

One of first significant new Jackson extension projects (result of Jackson 1.7 release which made it much easier to provide modular extensions) is jackson-xml-databind, hosted at GitHub. Although this extension is still in its pre-1.0 development phase, the latest released version is fully usable as is and is even in some limited production use by some brave developers (running on Google AppEngine, of all things!).

So it is probably a good idea to now give a brief overview of what this project is all about.

1. What is jackson-xml-databind?

Jackson-xml-databind comes in a small package (jar is only about 55 kB) , and is used with Jackson data binding functionality (jackson-mapper jar). It provides basic replacement for JsonFactory, JsonParser and JsonGenerator components of Jackson Streaming API, and allows reading and writing of XML instead of JSON, in context of generic Jackson data binding functionality. In addition, core ObjectMapper is also sub-classed to provide customized versions of couple of other provider types, so typically all usage is done by creating com.fasterxml.jackson.xml.XmlMapper instead of ObjectMapper, and using it for data binding.

2. What is it used for?

This package is used to read XML and convert it to POJOs, as well as to write POJOs as XML. In this respect it is very similar to JAXB (javax.xml.bind) package; and an alternative for many other Java XML data binding packages such as XStream and JibX. Given Jackson support for JAXB annotations, it can be especially conveniently used as a JAXB replacement in many cases.

Functionality supported is in some ways a subset of JAXB, and in other ways a superset: XML-specific functionality is more limited (no explicit support for XML Schema), but general data binding functionality is arguably more powerful (since it is full set of Jackson functionality).

Two obvious benefits of this package compared to JAXB or other existing XML data binding solutions (like XStream) are superior performance -- with fast Stax XML parser, this is likely the fastest data binding solution on Java platform (see jvm-serializers for results) -- and extensive and customizable data POJO conversion functionality, using all existing Jackson annotations and configuration options. The main downside currently is potential immaturity of the package; however, this only applies to interaction between mature XML packages (stax implementation) and Jackson data binder (which is also fairly mature at this point).

3. So how do I use it?

If you know how to use Jackson with JSON, you know almost everything you need to use this package. The only other thing you need to know is that there has to be a Stax XML parser/generator implementation available. While JDK 1.6 provides one implementation, your best best is using something bit more efficient, such as Woodstox or Aalto. Both should work fine; Aalto is faster of two, but Woodstox is a more mature choice. So you will probably want to include one of these Stax implementations when using jackson-xml-databind.

Other than this, all you need to do is to construct XmlMapper:

  XmlMapper mapper = new XmlMapper(); // can also specify XmlFactory to 
  use, to override Stax factories used

and use it like you would any other ObjectMapper, like so:

  User user = new User(); // from Jackson-in-five-minutes sample
String xml = mapper.writeValueAsString(user);

and what you would get is something like:

<User>
  <name>
    <first>Joe</first>
    <last>Sixpack</last>
  </name>
  <verified>true</verified>
  <gender>MALE</gender>
  <userImage>AQIDBAU=</userImage>
</User>

which is equivalent of JSON serialization that would look like:

{
  "name":{
    "first":"Joe",
    "last":"Sixpack"
  },
  "verified":true,
  "gender":"MALE",
  "userImage":"AQIDBAU="
}

Pretty neat eh?

Oh, and reverse direction obviously works similarly:

  User user = mapper.readValue(xml, User.class);

There is really nothing extra-ordinary in it usage; just another way to use Jackson for slicing and dicing your POJOs.

4. Limitations

While existing version works pretty well in general, there are some limitations. These mostly stem from the basic difference between XML and JSON logical models; and specifically affect handling of Lists/arrays. XmlMapper for example only allows so-called "wrapped" lists (for now); meaning that there is one wrapper XML element for each List or array property, and separate element for each List item.

Compared to JAXB (and related to JAXB annotation support), no DOM support is included; meaning, it is not possible to use converters that take or produce DOM Elements.

With respect to Jackson functionality, while polymorphic type information does work, some combinations of settings may not work as expected.

And given project's pre-1.0 status, testing is not yet as complete as it needs to be, so other rough edges may also be found. But with help of user community I am sure we can polish these up pretty quickly.

5. Feedback time!

So what is needed most at this point? Users, usage, and resulting bug (or, possibly, success) reports! Seriously, more usage there is, faster we can get the project up to 1.0 release.

Happy hacking!

Saturday, November 20, 2010

StaxMate 2.0.1 released; improved DOM-from-Stax, compatibility with default JDK 1.6 Stax implementation

Quick update from "XML world" -- in which I have spent much less time, due to explosive growth in JSON land: StaxMate 2.0.1 was just released.

1. StaxMate?

First question you might ask is "What the heck is StaxMate?". Fair enough -- given how little attention it has gotten, here is the main idea.

StaxMate is meant to offer "convenience of DOM with performance of Stax (or SAX)". Although Stax API was an improvement in usability for many use cases, it is still a rather low-level access API. StaxMate builds concept of "cursors" when reading content; and output context objects when writing content. Sample code and bit more in-depth explanation can be found from StaxMate Tutorial page; but basic idea is to offer better abstractions than simple flat event iterator. Sort of like how automatic transmission can simplify driving, compared to manual stick shift.

Working with cursors is typically similar to how DOM documents are traversed in simple top-down (recursive-descent) fashion: you start with root element, get child elements, locate more children, textual content and so forth. Same is done with StaxMate, with just one crucial limitation: all access must be done in document order (parent first, them children, in order they are in XML document). If you need to retain some information, you will do it explicitly (attribute values from parents need to be access before child elements, for example). StaxMate will take care to synchronize access when you use child cursors, so will never need to worry about skipping remaining siblings; you just can not access things in random order. Same is also true for output side; although there are ways to temporarily "freeze" output which does allow building content somewhat out-of-order, as necessary. This may be necessary for doing things like calculating parent attribute values based on content written for child elements.

The benefit of requiring access to be done in document order is that it means that there is no additional performance or memory overhead for keeping track of past content. Memory usage, therefore, is not very different from that of "raw" Stax parser or generator; same is true for performance. Overhead of DOM documents is often 3x - 5x that of streaming access; overhead of using StaxMate is typically in 10-20% range, sometimes even lower.

2. Fixes in 2.0.1

This patch release contains just 2 fixes, but both are quite important, so upgrade is strongly recommended.

First fix is to DOM-compatibility part (see "Reading DOM documents using Stax XML parser, StaxMate" for details on usage). It turns out that although building full DOM document worked fine with 2.0.0, there were issues if binding sub-trees; these issues should now be resolved.

Second fix is to interoperability with Stax parsers that do not implement Stax2 extension API (to date, Woodstox and Aalto do implement this, but not others; most notably, Sun Sjsxp which is the default Stax parser bundled with JDK 6). Although most operations work just fine, Typed Access accessors (getting XML element text as number, boolean value, enum) could cause state update to work incorrectly, leading to issues when accessing sequence of typed values. This has been resolved, by fixing the underlying problem in Stax2 API reference implementation library that StaxMate depends (version 3.0.4 of the library contains fixes).

Thursday, December 31, 2009

Upgrading from Woodstox 3.x to 4.0

It has now been almost one year since Woodstox 4.0 was released.
Given this, it would be interesting to know how many Woodstox users continue using older versions, and how many have upgraded.

My guess (somewhat educated, too, based on bug reports and some statistcs on Maven dependencies) is that adoption has been quite slow. I think this is primarily due to 3 things:

  1. Older versions work well, and fulfill all current needs of the user
  2. New functionality that 4.0 offers is not widely known, and/or is not (currently!) needed
  3. There are concerns that because this is a major version upgrade, upgrade might not go smoothly.

I can not argue against (1): Woodstox has been a rather solid product since first official releases; and 3.2 in particular is a well-rounded rock solid XML processor (if you are using an earlier version, however, at least upgrade to latest 3.2 patch version, 3.2.9!).
And with respect to (2), I have covered most important pieces of new functionality, Typed Access API and Schema Validation.

But so far I have not written anything about incompatible changes between 3.2 and 4.0 versions. So let's rectify that omission.

1. Why Upgrade?

But first: maybe it is worth iterating couple of reasons why you might want to upgrade at all:

  1. You might want to validate XML documents you read or write against W3C Schema (aka XML Schema). Earlier versions only allowed validating against DTDs and Relax NG schemas
  2. If you want to access typed content -- that is, numbers, XML qualified names, even binary content, contained as XML text -- new Typed Access API simplifies code a lot, and also makes it more efficient.
  3. Latest versions of useful helper libraries like StaxMate require Woodstox 4.0 (StaxMate 2.0 needs 4.x, for example)
  4. No new development will be done for 3.2 branch; and eventually not even bug fixes.

Assuming you might want to upgrade, what possible issues could you face?

2. Backwards incompatible changes since 3.2

Based on my own experiences, there are few issues with upgrade. Although the official list of incompatibilities has a few entries, I have only really noticed one class of things that tend to fail: Unit tests!

Sounds bad? Actually, yes and no: no, because these are not real failures (ones I have seen). And yes, since it means that you end up fixing broken test code (extra overhead without tangible benefits). But this is one of challenges with unit tests: fragility is often desireable, but not always so.

Specific problem that I have seen multiple times is related to one cosmetic aspect of XML: inclusion of white space with elements.

Woodstox 3.2 used to output empty elements with "extra" white space, like so:

<empty />

but 4.0 will not add this white space:

<empty/>

(this is a new feature as per WSTX-125 Jira entry)

and so some existing unit tests for systems I have worked on compare literal XML for output tests. This is not optimal, but it is bit less work than writing tests in more robust way, to check for logical (not physical) equality. So whereas they formerly assume existence of such white space, tests need to be modified not to expect it (or allow either way).

3. Other challenges?

Actually, I have not seen any actual problems, or other cosmetic problems. But here are other changes that are most likely to cause compatibility problems (refer to the full list mentioned earlier for couple of changes that are much less likely to do so):

  • "Default namespace" and "no prefix" are now consistently reported as empty Strings, not nulls (unless explicitly specified otherwise in relevant Stax/Stax2 Javadocs). Usually this does not cause problems, because Stax-dependant code has had to deal with inconsistencies with other Stax implementations; but could cause problems if code is expecting null.
  • "IS_COALESCING" was (accidentally) enabled for Woodstox versions prior to 4.0. This was fixed for 4.0 (as per Stax specification), but it is possible that some code was assuming on never getting partial text segments (if developer was not aware of Stax allowing such splitting of segment, similar to how SAX API does it.

4. Upgrade or not?

I would recommend investigating upgrade; if for nothing else, because of maintenance aspect. Pre-4.0 versions will not be actively maintained in future. But it is good to be aware of what has changed, and of course having good set of unit tests should guard against unexpected problems.

And hey, it's soon 2010 -- Woodstox 3.2 is soooo 2008. :-)

Wednesday, October 28, 2009

Data Format anti-patterns: converting between secondary artifacts (like xml to json)

One commonly asked but fundamentally flawed question is "how do I convert xml to json" (or vice versa).
Given frequency at which I have encountered it, it probably ranks high on list of data format anti-patterns.

And just to be clear: I don't mean that there is any problem in having (or wanting to have) systems that produce data using multiple alternative data formats (views, representations). Quite on contrary: ability to do so is at core of REST(-like) web services, which are one useful form of web services. Rather, I think it is wrong to convert between such representations.

1. Why is it Anti-pattern?

Simply put: you should never convert from secondary (non-authoritative) representation into another such representation. Rather, you should render your source data (which is usually in relational model, or objects) into such secondary formats. So: if you need xml, map your objects to xml (using JAXB or XStream or what you have); if you need JSON, map it using Jackson. And ditto for the reverse direction.

This of course implies that there are cases where such transformation might make sense: namely, when your data storage format is XML (Native Xml DBs) or Json (CouchDB). In those cases you just have to worry about the practical problem of model/format impedance, similar to what happens when doing Object-Relational Mapping (ORM).

2. Ok: simple case is simple, but how about multiple mappings?

Sometimes you do need multi-step processing; for example, if your data lives in the database. Following my earlier suggestion, it would seem like you should convert directly from relational model (storage format) into resulting transfer format (json or xml). Ideally, yes: if there are such conversions. But in practice it is more likely that a two-phase mapping (ORM from database to objects; and then from objects to xml or json) works better: mostly because there are good tools for separate phases, but fewer that would do the end-to-end rendition.

Is this wrong? No. To understand why, it is necessary to understand 3 classes of formats that are talking about:

  • Persistence (storage) format, used for storing your data: usually relational model but can be something else as well (objects for object DBs; XML for native XML databases)
  • Processing format: Objects or structs of your processing language (POJOs for Java) that you use for actual processing. Occasionally this can also be something more exotic; like XML when using XSLT (or relational data for complicated reporting queries)
  • Transfer format: Serialization format used to transfer data between end points (or sometimes time-shifting, saving state over restart); may be closely bound to processing format (as is the case for Java serialization)

So what I am really saying is that you should not transfer within a class of formats; in this case between 2 alternate transfer formats. It is acceptable (and often sensible) to do conversions between classes of formats; and sometimes doing 2 transforms is simpler than trying to one bigger one. Just not within a class.

3. Three Formats may be simpler than Just One

One more thing about above-mentioned three formats: there is also a related fallacy of thinking that there is a problem if you are using multiple formats/models (like relational model for storage, objects for processing and xml or json for transfer). Assumption is that additional transformations needed to convert between representations is wasteful enough to be a problem in and of itself. But it should be rather obvious why there are often distinct models and formats in use: because each is optimal for specific use case. Storage format is good for, gee, storing data; processing model good for efficiently massaging data, and transfer format good for piping it through the wire. As long as you don't add gratuitous conversions in-between, transforming on boundary is completely sensible; especially considering alternative of trying to find a single model that works for all cases. One only needs to consider case of "XML for everything" cluster (esp. XML for processing, aka XSLT) to see why this is an approach that should be avoided (or, Java serialization as transfer format -- that is another anti-pattern in and of itself).

Thursday, October 01, 2009

Critical updates: Woodstox 4.0.6 released

This just in: Woodstox 4.0.6 was released, and it contains just one fix; but that one to a critical problem (text content truncation for long CDATA sections, when using XMLStreamReader.getElementText()). Upgrade is highly recommended for anyone using earlier 4.0 releases.

One more potentially useful addition is that I uploaded "relocation" Maven pom, for non-existing artifact "wstx-asl" v4.0.6 (the real id is "woostox-core-asl", as of 4.0; "wstx-asl" was used with 3.2 and previous). This was suggested by a user, to make upgrade bit less painful -- problem is that Woodstox tends to be one of those ubiquitous transitive dependencies to anyone running a Soap service (or nowadays almost any server-side XML processing system).

Next big thing should then be Jackson-1.3, stay tuned!

Friday, September 18, 2009

Typed Access API tutorial, part III/b: binary data, server-side

(note: this is part B of "Typed Access API tutorial: binary data"; first part can be found here)

1. Server-side

After implementing the client, let's next implement matching sample service that simply reads all files from a directory and creates download message that contains all files along with checksums for verifying their correctness (in real use case, those would probably be pre-computed). Simplest way to deploy service is as a Servlet-based web application; a single class and matching web.xml will do the trick.

Resulting code is meant to just show how (relatively) simple handling of binary data is -- obviously a real client and service would have much more checking for error cases, as well as for authentication, authorization, namespacing to avoid collision and so on.

Full source code can be found from Woodstox source code repository (see 'src/samples/BinaryService.java') but here is the beef:


    public void doGet(HttpServletRequest req, HttpServletResponse resp)
        throws IOException
    {
        resp.setContentType("text/xml");
        try {
            writeFileContentsAsXML(resp.getOutputStream());
        } catch (XMLStreamException e) {
            throw new IOException(e);
        }
    }

    final static String DIGEST_TYPE = "SHA"; 

private void writeFileContentsAsXML(OutputStream out) throws IOException, XMLStreamException { XMLStreamWriter2 sw = (XMLStreamWriter2) _xmlOutputFactory.createXMLStreamWriter(out); sw.writeStartDocument(); sw.writeStartElement("files"); byte[] buffer = new byte[4000]; MessageDigest md; try { md = MessageDigest.getInstance(DIGEST_TYPE); } catch (Exception e) { // no such hash type? throw new IOException(e); } for (File f : _downloadableFiles.listFiles()) { sw.writeStartElement("file"); sw.writeAttribute("name", f.getName()); sw.writeAttribute("checksumType", DIGEST_TYPE); FileInputStream fis = new FileInputStream(f); int count; while ((count = fis.read(buffer)) != -1) { md.update(buffer, 0, count);
// note: can write separate chunks without problems sw.writeBinary(buffer, 0, count); } fis.close(); sw.writeEndElement(); // file sw.writeStartElement("checksum"); sw.writeBinaryAttribute("", "", "value", md.digest()); sw.writeEndElement(); // checksum } sw.writeEndElement(); // files sw.writeEndDocument(); sw.close(); }

As with the client, there really isn't anything too special here. Just the usual service, with bit of Stax2 Typed Access API usage.

I briefly tested this by bundling it up as a web app (if you want to do the same, run Ant target "war.samples" in Woodstox trunk), running web app under Jetty 6.1, and accessing from both web browser and via BinaryClient class. Worked as expected right away (which, granted, was somewhat unexpected... usually there are minor tweaks needed, but not today).

2. Output

Just to give an idea of what results should look like, here's what I can see when download a single file (run.sh):


<?xml version='1.0' encoding='UTF-8'?>
<files><file name="run.sh" checksumType="SHA">IyEvYmluL3NoCgojIExldCdzIGxpbWl0IG1lbW9yeSwgZm9yIHBlcmZvcm1hbmNlIHRlc3RzIHRv IGFjY3VyYXRlbHkgY2FwdHVyZSBHQyBvdmVyaGVhZAoKIyAtRGphdmEuY29tcGlsZXI9IC1jbGll bnQgXApqYXZhIC1YWDpDb21waWxlVGhyZXNob2xkPTEwMDAgLVhteDQ4bSAtWG1zMTZtIC1zZXJ2 ZXJcCiAtY3AgbGliL3N0YXgtYXBpLTEuMC4xLmphcjpsaWIvc3RheF9yaS5qYXJcCjpsaWIvbXN2 L1wqXAo6bGliL2p1bml0L2p1bml0LTMuOC4xLmphclwKOmJ1aWxkL2NsYXNzZXMvd29vZHN0b3g6 YnVpbGQvY2xhc3Nlcy9zdGF4MlwKOnRlc3QvY2xhc3NlczpidWlsZC9jbGFzc2VzL3Rvb2w6YnVp bGQvY2xhc3Nlcy9zYW1wbGVzXAogJCoK</file><checksum value="qAZIQ6GDUJYRgiubW/H+5GZaWg0="/></files>

3. More to known about Base64 variants

One more thing to note is the existence of multiple slightly incompatible Base64 variants (see "URL Applications" section). So which one does Typed Access API use?

The one you define it to use, of course! Stax2 API actually allows caller to specify the variant to use -- sample code just happens to use the default variant (i.e. uses methods that just call alternatives that do take a Base64Variant argument). Stax2-defined Base64 variants (from class 'org.codehaus.stax2.typed.Base64Variants') are:

  • MIME: this is what is usually considered "the base64" variant: uses default alphabet, requires padding, and uses 76-character lines with linefeed for content. This is the default variant used for element content.
  • MIME_NO_LINEFEEDS is similar to MIME, but does not split output in lines -- this is the default variant used for attribute values (due to verbosiveness caused by encoding linefeeds in XML attribute values)
  • PEM is similar to MIME, but mandates shorter (60 character) line length
  • MODIFIED_FOR_URL: uses alternate alphabet (hyphen and underscore instead of plus and slash), does not use padding or line splitting.

And these are all implemented by Woodstox. In addition, one can use custom encodings by implementing custom Base64Variant object and passing that explicitly to base64-binary read- and write-methods.

4. Performance?

Beyond simple usage shown so far, what more is there to know about handling binary data?

One open question is performance: how much faster is Typed Access API, compared to using alternatives like XMLStreamReader.getElementText() followed by decode using, say, JakartaCommons' base64 codec. There are no numbers yet, but producing some will be one of high priority items on my "things to research for Blog" list.

Tuesday, September 08, 2009

Typed Access API tutorial, part III/a: binary data, client-side

(author's note: oh boy, this last piece of the "Typed Access API series" has been long coming -- apologies, and "better late than never")

Now that we have tackled most of the Stax2 Typed Access API (reading and writing simple values, arrays), let's consider the last remaining part: that of reading and writing base64-encoded binary data. For this installment, let's implement a simple web service that can be used for downloading files, as well as client to use that service.

Use of XML for such purpose may seem bit contrived, but there are other valid use cases for binary-in-xml (even if the example wasn't): for example, it may well make sense to embed small images (like icons), digital signatures, encryption keys and other non-textual data within documents. Sometimes convenience of inlining binary content within message is worth the modest overhead (base64 imposes +33% storage overhead, and similar processing overhead).
For example, in our example, we can embed multiple files with associated metadata quite easily without having to split the logical document. But both client and server can still handle files one-by-one with streaming interfaces, meaning that memory usage need not grow without bounds.

Finally, unlike many other xml processing packages, Woodstox does not cut corners when it comes to processing efficiency: base64 processing implementation is a significant improvement over using existing third-party base64 codes on other processing APIs (regular SAX, Stax or DOM).

So much for the philosophic part of why to use (or not to use) xml. Let's have look at a simple implementation to show binary content handling pieces that we need, along with a bit of glue to make example code work.
(note: source code is also accessible)

1. Message format

Here is the simple xml message format we will be using:

  <files>
<file name="test.jpg" checksumType="SHA">... base64 encoded content ...</file>
<checksum value="...base64 encoded hash of content..." />
<!-- ... and more files, if need be... -->
</files>

That is, a single message contains one or more files, each with associated checksum. Checksym is used to verify that contents were passed unmodified (as opposed to being corrupted by transfer). Simple but functional.

2. Client-side

So let's start with sample client code; code downloads bunch of files from the service (for now assuming URL determines set of files we'll get with some criteria).

For this example we will just use the regular http client that JDK comes equipped with (which actually works pretty well for many use cases -- for others, Jakarta httpclient is the cat's meow).
Full source code can be found at Woodstox SVN repository (under 'src/samples') but here's the interesting Client method:


public List<File> fetchFiles(URL serviceURL) throws Exception
{
  List<File> files = new ArrayList<File>();
URLConnection conn = serviceURL.openConnection(); conn.setDoOutput(false); // only true when POSTing conn.connect(); // note, should check 'if (conn.getResponseCode() != 200) ...' // Ok, let's read it then... (note: StaxMate could simplify a lot!) InputStream in = conn.getInputStream(); XMLStreamReader2 sr = (XMLStreamReader2) XMLInputFactory.newInstance().createXMLStreamReader(in); sr.nextTag(); // to "files" File dir = new File("/tmp"); // for linux... byte[] buffer = new byte[4000]; while (sr.nextTag() != XMLStreamConstants.END_ELEMENT) { // one more 'file' String filename = sr.getAttributeValue("", "name"); String csumType = sr.getAttributeValue("", "checksumType"); File outputFile = new File(dir, filename); FileOutputStream out = new FileOutputStream(outputFile); files.add(outputFile); MessageDigest md = MessageDigest.getInstance(csumType); int count; // Read binary contents of the file, calc checksum and write while ((count = sr.readElementAsBinary(buffer, 0, buffer.length)) != -1) { md.update(buffer, 0, count); out.write(buffer, 0, count); } out.close(); // Then verify checksum sr.nextTag(); byte[] expectedCsum = sr.getAttributeAsBinary(sr.getAttributeIndex("", "value")); byte[] actualCsum = md.digest(); if (!Arrays.equals(expectedCsum, actualCsum)) { throw new IllegalArgumentException("File '"+filename+"' corrupt: content checksum does not match expected"); } sr.nextTag(); // to match closing "checksum" } return files; }

Much of the code deals with connecting to the service; actual access is rather simple; only complexity comes from streamability of API (i.e. you read chunks of binary data, instead of reading the whole thing).

What is left, then, is the server side... which will follow shortly (I swear, won't take months this time)

Thursday, July 09, 2009

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

  1. Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
  2. If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Saturday, June 27, 2009

Woodstox, high impact factor & being #32 on Top Open Source Java libs list

Another interesting data point, this time from analysing Maven Dependency paths: "Most Referenced" list. Looks like Woodstox is quite widely used by projects that use or at least declare their dependencies using Maven: I assume magic number 1838 (which gives rank #32) could mean number of other projects depending on Woodstox. Not too shabby for an xml parser. Getting on the first result page is quite remarkable; especially considering that Woodstox ranks higher than many other worthy Java open source libraries like XStream, Hibernate, Quartz, Xalan and Velocity. And only slightly (by about 50% :-) ) trailing such ubiquitous thingy as Spring.

Although this is just one of way of estimating popularity of various (Java) OS libs, it is still interesting, because it has similarities to how scientific articles are ranked (impact factor; although here weights are uniform). And also since it could lend itself to Google PageRank style extensions as well... let's see.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.