Tuesday, January 20, 2009

Json processing with Jackson: Method #1/3: Reading and Writing Event Streams

(for background, refer to the earlier "Three Ways to Process Json" entry)

To continue with the thesis of "exactly 3 methods to process structured data formats (including Json)", let's have look at the first alleged method, "Iterating over Event Streams" (for reading; and "Writing to an Event Stream" for writing).
I must have already written a bit about this approach, given that it is the approach that Jackson has used from the very beginning. But, as romans put it: "Repetitio est mater studiorum". So let's have a (yet another) look at how Jackson allows applications to process Json content via Stream-of-Events (SoE ?) abstraction.

1. Reading from Stream-of-Events

Since Stream-of-Events is just a logical abstraction, not a concrete thing, first thing to decide is how to expose it. There are multiple possibilities; and here too there are 3 commonly used alternatives:

  1. As iteratable stream of Event Objects. This is the approach taken by Stax Event API. Benefits include simplicity of access, and object encapsulation which allows for holding onto Event objects during processing.
  2. As callbacks that denote Events as they happen, passing all data as callback arguments. This is the approach SAX API uses. It is highly performant and type-safe (each callback method, one per event type, can have distinct arguments) but may be cumbersome to use from application perspective.
  3. As a logical cursor that allows accessing concrete data regarding one event at a time: This is the approach taken by Stax Cursor API. The main benefit over event objects approach is the performance (similar to that of callback approach): no additional objects are constructed by the framework; and the application has to create objects if it needs any. And the main benefit over callback approach is simplicity of access by the application: no need to register callback handlers, no "Hollywood principle" (don't call us, we call you), just simple iteration over events using the cursor.

Jackson uses the third approach, exposing a logical cursor as "JsonParser" object. This choice was done by choosing combination of convenience and efficiency (other choices would offer one but not both of these). The entity used as cursor is named "parser" (instead of something like "reader") to closely align with the Json specification; the same principle is followed by the rest of API (so structured set of key/value fields is called "Object", and a sequence of values "Array" -- alternate names might make sense, but it seemed like a good idea to try to be compatible with the data format specification first!).

To iterate the stream, application advances the cursor by calling "JsonParser.nevToken()" (Jackson prefers term "token" over "event"). And to access data and properties of the token cursor points to, calls one of accessors which will refer to property of currently pointed-to token. This design was inspired by Stax API (which is used for processing XML content), but modified to better reflect specific features of Json.

So the basic ideas is pretty simple. But to give better idea of the details, let's make up an example. This one will be based on the Json-based data format described at http://apiwiki.twitter.com/Search+API+Documentation (and using first record entry of the sample document too), but using some simplifications (omitting fields, renaming).

  "text":"@stroughtonsmith You need to add a \"Favourites\" tab to TC/iPhone. Like what TwitterFon did. I can't WAIT for your Twitter App!! :) Any ETA?",

And to contain data parsed from this Json content, let's use a container Bean like this:

public class TwitterEntry
  long _id;  
  String _text;
  int _fromUserId, _toUserId;
  String _languageCode;

  public TwitterEntry() { }

  public void setId(long id) { _id = id; }
  public void setText(String text) { _text = text; }
  public void setFromUserId(int id) { _fromUserId = id; }
  public void setToUserId(int id) { _toUserId = id; }
  public void setLanguageCode(String languageCode) { _languageCode = languageCode; }

  public int getId() { return _id; }
  public String getText() { return _text; }
  public int getFromUserId() { return _fromUserId; }
  public int getToUserId() { return _toUserId; }
  public String getLanguageCode() { return _languageCode; }

  public String toString() {
    return "[Tweet, id: "+_id+", text='";+_text+"', from: "+_fromUserId+", to: "+_toUserId+", lang: "+_languageCode+"]";

With this setup let's try creating an instance of this Bean from sample data above.

First, here is a method that can read Json content via event stream and populate the bean:

 TwitterEntry read(JsonParser jp) throws IOException
  // Sanity check: verify that we got "Json Object":
  if (jp.nextToken() != JsonToken.START_OBJECT) {
    throw new IOException("Expected data to start with an Object");
  TwitterEntry result = new TwitterEntry();
  // Iterate over object fields:
  while (jp.nextToken() != JsonToken.END_OBJECT) {
   String fieldName = jp.getCurrentName();
   // Let's move to value
   if (fieldName.equals("id")) {
   } else if (fieldName.equals("text")) {
   } else if (fieldName.equals("fromUserId")) {
   } else if (fieldName.equals("toUserId")) {
   } else if (fieldName.equals("languageCode")) {
   } else { // ignore, or signal error?
    throw new IOException("Unrecognized field '"+fieldName+"'");
  jp.close(); // important to close both parser and underlying File reader
  return result;

And can be invoked as follows:

  JsonFactory jsonF = new JsonFactory();
  JsonParser jp = jsonF.createJsonParser(new File("input.json"));
  TwitterEntry entry = read(jp);

Ok, now that's quite a bit of code for a relatively simple operation. On plus side, it is simple to follow: even if you have never worked with Jackons or json format (or maybe even Java) it should be easy to grasp what is going on and modify code as necessary. So basically it is "monkey code" -- easy to read, write, modify, but tedious, boring and in its own way error-prone (because of being boring).
Another and perhaps more important benefit is that this is actually very fast: there is very little overhead and it does run fast if you bother to benchmark it. And finally, processing is fully streaming: parser (and generator too) only keeps track of the data that the logical cursor currently points to (and just a little bit of context information for nesting, input line numbers and such).

Example above hints at possible use case for using "raw" streaming access to Json: places where performance really matters. Another case may be where structure of content is highly irregular, and more automated approached would not work (why this is the case becomes more clear with follow-up articles: for now I just make the claim), or the structure of data and objects has high impedance.

2. Writing to Stream-of-Events

Ok, so reading content using Stream-of-Events is simple but laborious process. It should be no surprise that writing content is about the same; albeit with maybe just a little bit less unnecessary work. Given that we now have a Bean, constructed from Json content, we might as well try writing it back (after being, perhaps, modified in-between). So here's the method for writing a Bean as Json:

private void write(JsonGenerator jg, TwitterEntry entry) throws IOException { jg.writeStartObject(); // can either do "jg.writeFieldName(...) + jg.writeNumber()", or this: jg.writeNumberField("id", entry.getId()); jg.writeStringField("text", entry.getText()); jg.writeNumberField("fromUserId", entry.getFromUserId()); jg.writeNumberField("toUserId", entry.getToUserId()); jg.writeStringField("langugeCode", entry.getLanguageCode()); jg.writeEndObject(); jg.close(); }
And here code to call the method:
  // let's write to a file, using UTF-8 encoding (only sensible one)
  JsonGenerator jg = jsonF.createJsonGenerator(new File("result.json"), JsonEncoding.UTF8);
  jg.useDefaultPrettyPrinter(); // enable indentation just to make debug/testing easier
  TwitterEntry entry = write(jg, entry);

Pretty simple eh? Neither challenging nor particularly tricky to write.

3. Conclusions

So as can be seen from above, using basic Stream-of-Events is quite primitive way to process Json content. This results in both benefits (very fast, fully streaming [no need to build or keep an object hierarchy in memory] easy to see exactly what is going on) and drawbacks (verbose code, repetitive).

But regardless of whether you will ever use this API, it is good to at least be aware of how this works: this because is what other interfaces build on: data mapping and tree building both internally use the raw streaming API to read and write Json content.

And next: let's have a look at a more refined method to process Json: Data Binding... stay tuned!

blog comments powered by Disqus

Sponsored By

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.