Thursday, January 29, 2009

Tatu (better known as Che Guevara)

Ok, this may be news to some of you. To be honest, it was bit of news to me too. But it must be true, I found it from Wikipedia!

Apparently "Tatu" was the codename of mr Che Guevara during his stint in Congo, before embarking on the faithful mission in south America. Wow. I had no idea that he was a fan of mine! It is somewhat less surprising that my namesake is a world-renowned "Kinbaku performer".

And yes, oh yes, there is also a Russian pop band that took my name too (why couldn't they just name themselves like, I don't know, "Pop tarts" or something? -- ba-da-boom, thank you thank you, I'll be here the whole week! Tip the cows or something!).

Wednesday, January 28, 2009

Ecology: Don't have a cow, man (seriously!)

From February issue of Scientific American (yes, they must have a time machine to send this issue from the future), here's another interesting environmental factoid: red meat produces biggie size environmental problems in addition to clogging your arteries:

The Greenhouse Hamburger

(I so wish there was a way to deep-link into the article, but I guess that's reserved for subscibers; or maybe I just haven't yet found the right click path?)

Or rather, what is surprising is the scale of the thing: I was well aware of methane production part of bovine-based agriculture, which does contribute significant punch to the greenhouse effect, along with rice paddies (cows also contribute urban legend pearls such as "Cows produce enough methane to fill a zeppelin each day [or week?]" (patently untrue!) but I digress).

But I did not know that there are estimates that put beef production (or is the whole food production chain? article is ambiguous in this point) at 14 - 22% of TOTAL greenhouse generation, ranking alongside with transportation and industry. Another way to put things into perspective is to consider that pound-by-pound cow meat produces 13x as much green house impact as chicken meat; and the ratio to "Idaho beef" (aka spud) is 57x.

This seems like one good reason to develop further my latest gastrological masterpiece, "Spamofy Half-and-Half pasta". Try it, it actually is pretty good: pasta sauce where protein comes evenly split between Spam and Tofu. Not only healthier for you & planet, but also tasty, and less salty (sodium's dangers are vastly exaggerated, but it does cause bloating if nothing else). :-)

What, me Fast?

Ok, let's have a look at one "Linux and Java run circles around .Net" phenomenon sighting:

http://technotes.blogs.sapo.pt/1391.html

Looks like combination of Linux, Mina (or Grizzly), JibX, and our favorite "new kid on the block" lighting fast xml parser, Aalto can haul some serious xml data.

It would of course be nice to know how much of this throughput can be attributed to Aalto. My bet is, "quite a lot", but I might not be quite objective enough here. At any rate, cool results, I hope these can be verified by others!

This is actually second sighting of Aalto being used for something; the first one was also a benchmark. Not quite production use yet (and I wouldn't quite recommend production use quite yet, not before 1.0 release), but some serious evaluation at least.

ps. while Aalto may be faster, Woodstox is still the King of Open Source Java Xml parsing, and can also pack a punch when there's Need for Speed.

Tuesday, January 27, 2009

Egology aka "Google and I"

Something I used to do occasionally, back in the day, was to track what I had been up to lately. Being a lazy bum, I did this using Google as a free tracking device. A lazy guy's method of doing this is to google with one's name as the search phrase, and see what surfaces. This obviously only works for those of us with funny or weird names (sorry Paul, you are so out of luck!)
This way you will see a glimpse of your whereabouts as seen by the online world.

I have not been doing this for a while (a year or two?) now. Not only is it tacky, just an electronical means of navel gazing, but worse, I wasn't apparently doing a whole lot based on results. Or maybe it's just that Google wasn't paying attention (ha!). It was like watching paint dry, or perhaps grass grow. Or my running out of analogies to use. Top hits returned were too often ones pointing to my even-then-obsolete old home page (at my Alma Mater that I had left years prior), that were "still in progress"; as well as links to mailing lists of dead projects.

But over the weekend I decided to do one more peek via Google-o-matic, to get a retrospective of the Year 2008 by and according to Tatu.

Lo and behold! Things had changed since the last check (whenever it was). No more stale entries within top pages (bye bye Niksula home page!) -- whether that's due to Google getting better, or Helsinki University of Technology finally reclaiming wasted disk space. Fewer high-ranking links to mailing lists from late 90s. Life is good.

But beyond this, there were all kinds of interesting (... to me, anyways) as-of-yet-unknown-to-me tidbits that I can now start dropping in casual conversations (and especially non-casual ones!). Plus some other factoids I had briefly seen and promptly forgotten about.

Here are some of least unnoteworthy nuggets I ended up with:

  • Once upon a time, I contributed a minor patch to Lucene (query parser refactoring). Ditto for Kaffe, JDBM, TagSoup and XStream. Neat things is that these are all cool projects; with maybe exception of Kaffe that is (or, used to be? hey, is it alive again?) a dead if neat project.
  • I was (am?) considered an ActiveMQ contributor (I am flattered by this, but not quite sure what'd I do -- maybe they use Woodstox? -- thanks anyway James!)
  • Rock star references: while I am not yet considered a Rock Star Programmer (is there a Hall of Fame for coders? haven't gotten a call so far! And does one have to be dead to get inducted there? hope not?), I have racked up multiple (as in, two!) quotes from certified RSPs: thank you Dan and Kohsuke, you do Rock! (I suddenly feel "Almost Famous" -- kind of like those bands who no one ever listened to, expect few guys from bands like The Beatles and Rolling Stones -- but in a way it's much better as I have a nice job, don't starve to death or overdose on drugs, so I shouldn't really whine)
  • I am a lucky GAP winner (ok ok, I already blogged about this one earlier)
  • I am referenced by 2 actual physical books, sold by Amazon; "Secrets of the Rock Star programmers" (see reference 2 bullets up), and "Professional XML (Programmer to Programmer)" (this wrt Woodstox). These could come in handy during job interviews: like Jason Hunter likes to say, there's no better answer to question "what have you been up to" than a gesture pointing to a book on manager's book shelf and asking him/her to compare author name with candidate name. Well, I'll accept lesser fame of being mentioned by an author as the consolation price. Maybe "let's search Amazon to see where my name pops up" could come in handy one of these days.

There is of course lots of other miscellaneous business-as-usual stuff out there:

  • links to Open Source projects than I'm most involved with (Woodstox, Jackson, StaxMate, Java UUID Generator)
  • a reference or two my Master's Thesis (in publication archives, I don't think any actual publication refers to it, very low impact factor... I never was much of an academic! :-)
  • links to mailing lists archives n+1 other open source projects and such.

I also have to say I am quite impressed by what Google can gather, and especially what it can weed out as duplicate entries. I only had to waste half an hour of my life to gather the list above. :-)

At any rate, it does appear like year 2008 was an eventful for me after all.

May we live in interesting times during 2009 as well!

Monday, January 26, 2009

Eco^2 (Economy + Ecology) Rulez Ok?

Let's start with the money shot: here's the link that effected me to write this particular entry:

http://www.energystar.gov/index.cfm?c=sb_success.sb_successstories2008_johnsonbraund

The reason I really like things like this is that they combine two important but often conflicting aspects: economy of the project, and ecological impact of the project. This is just one random link, but one can't read a respectable magazine like, say, Fortune, without spotting one or two each time. That's awesome.

I have been a closet environmentalist for years, specifically after moving from the western Europe to US in late 90s. That is rather typical: most people who have grown up in a lutheran, reason- and reasonability-loving ("everything in moderation") society should feel similar nausea over this macabre era of McMansions and big fugly cars (SUVs)

But lately -- as in past two years or so -- there has been remarkably change in the air... wind of change, to use a cliche. From around the time of release of that "Al Gore movie", things have finally started moving in the right direction here on the left side of the Pond. Finally! I truly believe that Churchill was right with his witty and insightful quote: "The Americans will always do the right thing . . . After they've exhausted all the alternatives". That was within context of WW2, but it applies equally well to US handling of the global warming, or more generally to pollution as a global problem. And while some see this comment as pessimistic ("do these blubbering idiots always have to try every wrong approach first?"), I view it as optimistic. After all, not everyone does the right in the end, no matter what. Plus, americans as a nation tend to follow through; or at least have with the major undertakings such as, well, world wars. And then, as now, the strongest engine around these parts is the industro-economic one. Let's hope the big wheel will start turning for good.

About the only major remaining obstacle now is the excuse-inducing "we must get all countries to agree to act on this" attitude -- screw that, let's get to work! The rest can follow us, and we can follow, say, Germany, Denmark and Spain.

Sunday, January 25, 2009

Json processing with Jackson: Method #3/3: Tree Traversal

Update, 06-Mar-2009: Alas, code example will not work with Jackson 0.9.9 or above due to API changes; check out javadocs for replacements until I get a chance to rewrite the example

(for background, refer to the earlier "Three Ways to Process Json" entry)

Now that we have both the low-level (event streams) and high-level (data binding) approaches covered, let's consider the third and last alternative: that of using a tree model for traversing over Json content.

So what is the Tree Model that is traversed? It is a tree built from Json content. Tree consists of parent-chuld linked nodes that represent Json constructs such as Arrays ("[ ... ]"), Objects ("{ ... }") and values (true, false, Strings, numbers, nulls). This is similar to xml DOM, as DOM is the "standard" tree model for xml, and there are many alternative tree models (such JDom, Dom4j, XOM) available as well.
This tree can then be traversed, data within accessed, possibly modified and written back out as Json.

Before discussing the approach in more detail, let's have a look at some sample code.

1. Sample Usage

Since it is difficult to demonstrate actual benefits of the approach with simple structures (like the Twitter search entry shown earlier), let's consider something more complicated. Following made-up example of a collection of customer records will have to do:


[
 {
  "name" : {
    "first : "Mortimer",
    "middleInitial : "m",
    "last" : "Moneybags" 
  },
  "address" : {
    "street" : "1729 Opulent Street",
    "zipcode" : 98040,
    "state" : "WA"
  },
  "contactMethods" : [
    { "type" : "phone/home", "ref" : "206-232-1234" },
    { "type" : "phone/work", "ref" : "303-123-4567" }
  ]
 }
// (rest of entries omitted to save space)
]

Let us consider a case where we want to go through all customer entries, and extract some data out of each. Additionally we will add an "email" contact method for each entry, assuming none exist before changes (to simplify code).


TreeMapper mapper = new TreeMapper();
JsonNode root = mapper.readTree(new File("customers.json"));
// we'll get a "org.codehaus.jackson.map.node.ArrayNode" instance for json array, but no need for casts
for (JsonNode customerNode : root) {
  // we know "first" always exists if "name" exists, and is a TextNode (if not, could use 'getValueAsText')
  // (note: could use 'getElementValue' instead of 'getPath', but it's good practice to use getPath())
  String firstName = customerNode.getPath("name").getFieldValue("first").getTextValue();
  // has an address? (could also just use 'getPath()' which returns 'missing' node)
  int zip = -1;
  if (customerNode.getFieldValue("address") != null) {
    zip = customerNode.getFieldValue("address").getFieldValue("zipcode").getIntValue();
  }
  // either way, let's add email contact (that is assumed to be missing)
  ObjectNode email = mapper.objectNode();
  email.setElement("type", mapper.textNode("email"));
  email.setElement("ref", mapper.textNode(firstName+"_"+zip+"@foobar.com"));
  customerNode.getPath("address").appendElement(email);
}

So here we have something to give an idea of what tree traversal code may look like. Let's go back to conceptual musing for a while, before returning to practical concerns.

2. Differences between Tree Model and Data Binding

At first this approach may appear quite similar to data binding: after all, a bunch of interconnected objects is created from Json to be traversed, access, modified and possibly written out as Json again. But whereas data binding converts Json into Java objects (and vice versa), Tree model represents Json content. Tree models are a true native representation of Json content itself and somewhat removed from "real" Java objects: their only purpose is to allow more convenient access to Json than event streams. There is no business functionality involved with the generic node objects. Also, types available are limited to ones that Json natively supports. One important benefit is that there is simple, efficient and reliable one-to-one mapping between the tree model and Json, which means that there is no loss of information when reading Json into the tree model or writing tree model out as Json; and that such transformation is always possible. This is different from data binding where some conversions may not be possible, or need extra configuration and coding to occur.

Rather than regular java objects (that data binding operates on), the tree model here is quite similar to the "Poor Man's Object", plain old HashMap. HashMaps are often used by developers when they don't think they need a "real" object (or don't want to define Yet Another Class etc). Same benefits and challenges apply to tree models as to using HashMaps as flexible and sometimes convenient alternatives to specific Java classes.

3. Benefits

Given above description of what tree model is, what could be reasons to use them over data binding? Here are some common reasons:

  • Since we do not need specific Java objects to bind to, there may less code to write. Although access may not be as convenient, for simple tasks (especially for "throw-away" code) it is nice not to have to implement boring bean setter/getter code.
  • If the structure of Json content is highly irregular, it may be difficult (or impossible) to find or create equivalent Java object structure. Tree model may be the only practical choice.
  • For displaying any Json content (for, say, Json editor) no typing is generally available: but it is quite easy to render a tree. Tree model is a natural choice for internal access and manipulation.

One analogy is that of contrasting dynamic scripting languages (like Ruby, Python or Javascript) and statically typed languages such as Java: Tree Model would be similar to scripting languages, whereas data binding would be similar to Java.

4. Drawbacks

There are also drawbacks, including:

  • Since access is mostly untyped, many problems that would be found with typed alternative (data binding) may go unnoticed during development
  • Memory usage is proportional to content mapped (similar to data binding), so tree models can not be used with huge Json content, unless mapping is done chunk at a time. This is the same problem that data binding encounters; and sometimes the solution is to use Stream-of-Events instead.
  • For some uses, additional memory usage and processing overhead is unnecesary: specifically, when only generating (writing) Json, there is often no need to build an in-memory tree (or objects with data binding) if only Json output is needed. Instead, Stream-of-Events approach is the best choice.
  • Using Tree Model often leads to either procedural (non-object-oriented) code, or having to wrap pieces of Tree Model in specific Java classes; at which point more code gets written for little gain (compared to regular objects used with data binding)

In general it is good to be clear on why tree model is used over other alternatives: experience with xml processing often leads developers to be too eager to use tree-based processing for all tasks, even when it is not the best choice.

5. Future Plans

Since Jackson API is still evolving, there are many things within TreeMapper and JsonNode APIs that could and will be improved. More convenience methods will be added to simplify construction of new nodes, and to support common access patterns.

One specifically promising un-implemented idea is that of defining a Path or Query language (think of XPath and XQuery). There is a good chance that something like this gets implemented. There have been proposals (such as JsonPath;); these may form the basis of access language.

6. Next?

After reviewing the 3 canonical approaches, it is time to suggest guidelines for choosing between them.
Stay tuned!

Tuesday, January 20, 2009

Json processing with Jackson: Method #2/3: Data Binding

(for background, refer to the earlier "Three Ways to Process Json" entry)

After reviewing the first "canonical" Json processing method (reading/writing Stream of Events), let's go up the abstraction level, and consider the second approach: that of binding (aka mapping) data between Java Objects and Json content. That is, given Json content in some form (like, say, Stream-of-Events...), library can construct equivalent Java objects; and conversely it can write Json content given Java objects. And do this without explicit instructions for low-level read/write operations (such as code from the preceding blog entry). This is often the most "natural" approach to Java programmers, since it is Object-centric. Approach is often referred to as "code-first" in related contexts, such as when discussing methods to process xml content.

Jackson's Data Binding support comes through a single mapper object, org.codehaus.jackson.map.ObjectMapper. It can be used to read Json content and construct Java object(s); or conversely write Json content that describes given Object. The design is quite similar to what XStream, or JAXB do with xml. The main differences (beyond data format used) are conceptual -- XStream focuses on Object serialization, Jackson on data binding; and JAXB2 supports both "schema-first" and "code-first" (and maybe emphasizes former more) whereas Jackson does not use schemas of any kind. But similarities are still more striking that differences.

So much for the background: let's have a look at how things work, by using Data Binding interface to do same work as was done in the first entry using Stream-of-Events abstract.

1. Reading Objects from Json

Ok. Given that our first example needed about two dozens lines of code, how much code might we need here? It should be less, to support the claim of being more convenient. How about:

  ObjectMapper mapper = new ObjectMapper();
  TwitterEntry entry = mapper.readValue(new File("input.json"));

... two? I guess you could make a one-liner too; or, if you want to separate pieces out more, half a dozen. But definitely much less than the manual approach. And the difference only grows when considering more complex objects and object graphs: whereas manual serialization needs more and more code, data binding code may not grow at all. Sometimes you may need to configure mapper more to deal with edge cases, or add annotations to support non-standard naming; but even then it is just a fraction of code to write.

Here are some more examples, just to show how to do simple things:

  Boolean yesOrNo = mapper.readValue("true"); // returns Boolean.TRUE
int[] ids = mapper.readValue("1, 3, 98"); // new int[] { 1, 3, 98 }
Map<String, List<String> dictionary = mapper.readValue( "{ \"word\" : [ \"synonym1\", \"synonym2\" ] }", new TypeReference<Map<String, List<String>() { }); // trickier, due to Type Erasure
Object misc = mapper.readValue("[ 1, true, null ]", Object.class); // above will return a List with Integer(1), Boolean.TRUE and null as its elements
// and here's something different: instead of TwitterEntry, let's claim content is a Map! Map<String,Object> entryAsMap = mapper.readValue(new File("input.json"), new TypeReference<Map<String,Object>>() {} ); // works!
Map<String,Object> entryAsMap = (Map<String,Object>)mapper.readValue(new File("input.json"), Object.class); // as does this

Of these, only last two example may seem surprising: didn't we actually serialize a bean... so how can it become a Map? Because there is no such thing as type (java class) in Json content: ObjectMapper does its best to map Json content to specific Java type, and in general, Objects can be viewed as sort of "static Maps". Hence it is perfectly fine to "Map to Map" here. And finally, ObjectMapper also has sort of special handling for base type "Object.class": it signals that mapper is to use whatever Objects are best matches to Json content in question. For Strings this means Strings, for booleans java.lang.Boolean, for Json arrays java.util.List and for Json object structures java.util.Map. In this case it works similar to how explicitly specifying result to be of type Map works.

2. Writing Objects as Json

Given how simple it was to read Java objects from Json, how hard can it be to write them?
Not very:

  mapper.writeValue(new File("result.json"), entry);

In fact, I claim it is pathetically easy. So much for job security!

3. Where's the Catch?

Given how much simpler data binding appears compared to writing equivalent code by hand, why should anyone ever again write code to read or write Json (or xml) by hand? There are some legitimate reasons:

  • Primary problem is that data binding introducing tight and close coupling between data format and Java objects: if one changes, the other must change too. Sometimes this is ok: both can be modified. In other cases it is problematic: you may not be in position to control such changes. And while there are ways to configure binding, override functionality and add handlers, there is diminishing return: at some point it might be better to just bite the bullet and handle it all programmatically.
  • Efficiency may be problematic too: some data binding packages introduce significant overhead (speed, memory usage). Fortunately Jackson is not "one of those package": additional overhead is modest, often in 15-20% range.
  • Data binding is fundamentally non-streaming: so this approach does not work for huge data streams, without some modifications.

Of these, second and third can usually be resolved: performance may not be a problem to begin with, and partial streaming (chunking) can be achieved by binding sub-sections of content at a time, not the whole document, as long as there are suitable sections that can be processed independently.
For example:

 String doc = "[ 1, 2, 3, 4 ]";
 JsonParser jp = new JsonFactory().createJsonParser(doc);
 ObjectMapper mapper = new ObjectMapper();

 jp.nextToken(); // START_ARRAY
 while (jp.nextToken() != JsonToken.END_ARRAY) {
   Integer I = mapper.readValue(jp, Integer.class);
   // will point to "last event used for the Object", i.e. VALUE_NUMBER_INT itself
 }
 // and would work equally well with beans

would map each integer value one by one, separately. Same approach would obviously work with individual beans, Lists and Maps as well.

So this leaves the main problem: that of highly dynamic, non-structured or dynamically typed content. This is where the last processing approach may come in handy.... and that will be the subject of my next sermon. Drive safely!

Json processing with Jackson: Method #1/3: Reading and Writing Event Streams

(for background, refer to the earlier "Three Ways to Process Json" entry)

To continue with the thesis of "exactly 3 methods to process structured data formats (including Json)", let's have look at the first alleged method, "Iterating over Event Streams" (for reading; and "Writing to an Event Stream" for writing).
I must have already written a bit about this approach, given that it is the approach that Jackson has used from the very beginning. But, as romans put it: "Repetitio est mater studiorum". So let's have a (yet another) look at how Jackson allows applications to process Json content via Stream-of-Events (SoE ?) abstraction.

1. Reading from Stream-of-Events

Since Stream-of-Events is just a logical abstraction, not a concrete thing, first thing to decide is how to expose it. There are multiple possibilities; and here too there are 3 commonly used alternatives:

  1. As iteratable stream of Event Objects. This is the approach taken by Stax Event API. Benefits include simplicity of access, and object encapsulation which allows for holding onto Event objects during processing.
  2. As callbacks that denote Events as they happen, passing all data as callback arguments. This is the approach SAX API uses. It is highly performant and type-safe (each callback method, one per event type, can have distinct arguments) but may be cumbersome to use from application perspective.
  3. As a logical cursor that allows accessing concrete data regarding one event at a time: This is the approach taken by Stax Cursor API. The main benefit over event objects approach is the performance (similar to that of callback approach): no additional objects are constructed by the framework; and the application has to create objects if it needs any. And the main benefit over callback approach is simplicity of access by the application: no need to register callback handlers, no "Hollywood principle" (don't call us, we call you), just simple iteration over events using the cursor.

Jackson uses the third approach, exposing a logical cursor as "JsonParser" object. This choice was done by choosing combination of convenience and efficiency (other choices would offer one but not both of these). The entity used as cursor is named "parser" (instead of something like "reader") to closely align with the Json specification; the same principle is followed by the rest of API (so structured set of key/value fields is called "Object", and a sequence of values "Array" -- alternate names might make sense, but it seemed like a good idea to try to be compatible with the data format specification first!).

To iterate the stream, application advances the cursor by calling "JsonParser.nevToken()" (Jackson prefers term "token" over "event"). And to access data and properties of the token cursor points to, calls one of accessors which will refer to property of currently pointed-to token. This design was inspired by Stax API (which is used for processing XML content), but modified to better reflect specific features of Json.

So the basic ideas is pretty simple. But to give better idea of the details, let's make up an example. This one will be based on the Json-based data format described at http://apiwiki.twitter.com/Search+API+Documentation (and using first record entry of the sample document too), but using some simplifications (omitting fields, renaming).

{
  "id":1125687077,
  "text":"@stroughtonsmith You need to add a \"Favourites\" tab to TC/iPhone. Like what TwitterFon did. I can't WAIT for your Twitter App!! :) Any ETA?",
  "fromUserId":855523, 
  "toUserId":815309,
  "languageCode":"en"
}

And to contain data parsed from this Json content, let's use a container Bean like this:

public class TwitterEntry
{
  long _id;  
  String _text;
  int _fromUserId, _toUserId;
  String _languageCode;

  public TwitterEntry() { }

  public void setId(long id) { _id = id; }
  public void setText(String text) { _text = text; }
  public void setFromUserId(int id) { _fromUserId = id; }
  public void setToUserId(int id) { _toUserId = id; }
  public void setLanguageCode(String languageCode) { _languageCode = languageCode; }

  public int getId() { return _id; }
  public String getText() { return _text; }
  public int getFromUserId() { return _fromUserId; }
  public int getToUserId() { return _toUserId; }
  public String getLanguageCode() { return _languageCode; }

  public String toString() {
    return "[Tweet, id: "+_id+", text='";+_text+"', from: "+_fromUserId+", to: "+_toUserId+", lang: "+_languageCode+"]";
  }
}

With this setup let's try creating an instance of this Bean from sample data above.

First, here is a method that can read Json content via event stream and populate the bean:

 TwitterEntry read(JsonParser jp) throws IOException
 {
  // Sanity check: verify that we got "Json Object":
  if (jp.nextToken() != JsonToken.START_OBJECT) {
    throw new IOException("Expected data to start with an Object");
  }
  TwitterEntry result = new TwitterEntry();
  // Iterate over object fields:
  while (jp.nextToken() != JsonToken.END_OBJECT) {
   String fieldName = jp.getCurrentName();
   // Let's move to value
   jp.nextToken();
   if (fieldName.equals("id")) {
    result.setId(jp.getLongValue());
   } else if (fieldName.equals("text")) {
    result.setText(jp.getText());
   } else if (fieldName.equals("fromUserId")) {
    result.setFromUserId(jp.getIntValue());
   } else if (fieldName.equals("toUserId")) {
    result.setToUserId(jp.getIntValue());
   } else if (fieldName.equals("languageCode")) {
    result.setLanguageCode(jp.getText());
   } else { // ignore, or signal error?
    throw new IOException("Unrecognized field '"+fieldName+"'");
   }
  }
  jp.close(); // important to close both parser and underlying File reader
  return result;
 }

And can be invoked as follows:

  JsonFactory jsonF = new JsonFactory();
  JsonParser jp = jsonF.createJsonParser(new File("input.json"));
  TwitterEntry entry = read(jp);

Ok, now that's quite a bit of code for a relatively simple operation. On plus side, it is simple to follow: even if you have never worked with Jackons or json format (or maybe even Java) it should be easy to grasp what is going on and modify code as necessary. So basically it is "monkey code" -- easy to read, write, modify, but tedious, boring and in its own way error-prone (because of being boring).
Another and perhaps more important benefit is that this is actually very fast: there is very little overhead and it does run fast if you bother to benchmark it. And finally, processing is fully streaming: parser (and generator too) only keeps track of the data that the logical cursor currently points to (and just a little bit of context information for nesting, input line numbers and such).

Example above hints at possible use case for using "raw" streaming access to Json: places where performance really matters. Another case may be where structure of content is highly irregular, and more automated approached would not work (why this is the case becomes more clear with follow-up articles: for now I just make the claim), or the structure of data and objects has high impedance.

2. Writing to Stream-of-Events

Ok, so reading content using Stream-of-Events is simple but laborious process. It should be no surprise that writing content is about the same; albeit with maybe just a little bit less unnecessary work. Given that we now have a Bean, constructed from Json content, we might as well try writing it back (after being, perhaps, modified in-between). So here's the method for writing a Bean as Json:


private void write(JsonGenerator jg, TwitterEntry entry) throws IOException { jg.writeStartObject(); // can either do "jg.writeFieldName(...) + jg.writeNumber()", or this: jg.writeNumberField("id", entry.getId()); jg.writeStringField("text", entry.getText()); jg.writeNumberField("fromUserId", entry.getFromUserId()); jg.writeNumberField("toUserId", entry.getToUserId()); jg.writeStringField("langugeCode", entry.getLanguageCode()); jg.writeEndObject(); jg.close(); }
And here code to call the method:
  // let's write to a file, using UTF-8 encoding (only sensible one)
  JsonGenerator jg = jsonF.createJsonGenerator(new File("result.json"), JsonEncoding.UTF8);
  jg.useDefaultPrettyPrinter(); // enable indentation just to make debug/testing easier
  TwitterEntry entry = write(jg, entry);

Pretty simple eh? Neither challenging nor particularly tricky to write.

3. Conclusions

So as can be seen from above, using basic Stream-of-Events is quite primitive way to process Json content. This results in both benefits (very fast, fully streaming [no need to build or keep an object hierarchy in memory] easy to see exactly what is going on) and drawbacks (verbose code, repetitive).

But regardless of whether you will ever use this API, it is good to at least be aware of how this works: this because is what other interfaces build on: data mapping and tree building both internally use the raw streaming API to read and write Json content.

And next: let's have a look at a more refined method to process Json: Data Binding... stay tuned!

Saturday, January 17, 2009

There are Three -- and Only Three -- Ways to Process Json!

With profileration of Json processing packages on Java platform, there seems to be n+1 ways to slice and dice Json. And each library seems to consider its way (be it how obscure) to be the One True Way to do things, without even acknowledging that there might be other ways, or bothering to offer alternative methods itself. This is particularly odd for people with xml background, who are used to standardization to a limited set of APIs, even though as a data format it is much more versatile (and complicated) than Json. So what's with this nonsense about myriad ways to slice and nice a dead simple data format?

I'll let you in on a secret: there is not just one sensible way to process Json. There also aren't dozens of sensible alternatives. There are exactly 3 methods to this madness.

  1. Iteration: Iterating over Event (or, token) stream
  2. Data Binding: Binding Json data into Objects (of your favorite language)
  3. Tree Traversal: Building a tree structure (from Json) and traversing it using suitable methods

To give a better idea of what these mean, let us consider Java Standard APIs for these canonical processing methods:

  1. SAX and Stax. These are APIs that essentially allow iterating over events: with SAX it's the parser that spams you with the events, and with Stax, you will traverse them at your leisure pace. Push versus Pull, but an event stream all the same. Ditto regarding how events are expressed; as callbacks (SAX), event objects (Stax Event API) or just logical cursor state (Stax Cursor API); just variants of the same approach.
    1. (*) (It is also possible to build more elaborate and convenient facades around this approach: witness StaxMate ("your Stax parser's perfect mate") with its smooth mellow (and yet surprisingly sophisticated!) approach to efficient xml processing -- but I digress!)
  2. JAXB is the standard for data binding; and while there are n+1 alternatives (Jibx, XMLBeans, Castor etc. etc. etc.), they all do the same: convert (Java) Objects to xml and vice versa, some conveniently and efficiently, others less so.
  3. DOM is the "most standard" API that defines a tree structure and machinery around it; but as with data binding, there are multiple (better) alternatives as well (XOM, JDOM, DOM4j). And you traverse it either node-by-node, or using XPath.

But these are for xml. How does that relate to Json? Well, turns out, this is one area where the format doesn't matter all that much: all three approaches are valid, useful and relevant with Json as well. And probably with other structure data formats well. But I can not think of a fourth one that would immediately make sense (feel free to prove my ignorance by pointing out something!).

And here is the good thing: Your Favorite Json Process (quick! say Jackson!) implements all said 3 methods. Yay!

  1. Core package (jackson-core) contains JsonParser and JsonGenerator, which allow iterating over tokens
  2. ObjectMapper implements data binding functionality: Objects in, Json out, Json in, Objects out.
  3. TreeMapper can grow trees (expressed as JsonNodes) from Json, and print Json out of a JsonNode (and its children).

Hence, I have proven that the number is Three; that Three is a Good Number; and that Jackson Does Three. So Jackson is All Good. QED.

Friday, January 16, 2009

More Action, Jackson! (0.9.6) Thaw Me Objects!

So here's another Jackson release, this time version 0.9.6. Although the version difference is miniscule (just patch level increase -- we are running out of pre-1.0 version numbers here!), this is another rather significant upgrade. Why? Because now the Number One feature request -- Full Bean/POJO de-serialization (read Java objects from Json) -- is finally implemented.

So how do I do it? It is actually rather simple in most cases: try this:

  MyBean bean = new ObjectMapper().readValue(jsonFile, MyBean.class);
  // maybe something that you earlier wrote using "mapper.writeValue(bean, 
  jsonFile);"?

The only requirement is that MyBean has suitable access methods:

  • Getters (String getName() etc) are needed to serialize (write Objects as Json)
  • Setters (setName(String)) are needed to deserialize (read Objects from Json)
  • for deserialization, a no-arg default constructor is also needed

In future there may be additional serialization mechanisms (perhaps field-based introspection; and certainly annotations to rename properties and indicate methods with less regular names), but for now above will (have to) suffice.

In addition to handling beans well, support for Generics containers works as expected: the only caveats are that:

  • For properties of beans, full Generic type must be included in the set method for deserialization (not needed for serialization)
  • For root-level object type, basic class is not enough (do not try 'List l = mapper.readValue(src, List.class);', won't work): instead you must do: 'List<String> l = mapper.readValue(src, new TypeReference<List<String>>() { } );'. This complexity is due to Java Type Erasure (see earlier blog entries for more details)

So what else is new with 0.9.6? Packaging for one: Jackson now comes in 2 jars:

  • "Core" contains JsonParser and JsonGenerator APIs, implementations
  • "Mapper" contains 2 mappers: "ObjectMapper" that does data binding mentioned above, and "TreeMapper" that can construct DOM-like trees that consist of JsonNode objects. Latter are convenient for more dynamic (scripting-like) access and traversal.

As usual: if you have already used Jackson, do us all favour and download & use the new version! And if not, well, perhaps download and try out the new version! You will like it.

Sunday, January 11, 2009

Viva Tequila! Woodstox 4.0.0 released

By now this is bit of yesterday's news, but better late than never: Woodstox version 4.0.0, known as "Tequila", was released first thing this year (January first). Check out Download page for artifacts (Woodstox is now composed of 2 mandatory jars, see below; and optional ones for RelaxNG/XML Schema validation support), and release notes for details of what's new.

For very high-level overview here are the things I consider these the highlights compared to 3.2:

  • Typed Access API: read and write native data conveniently and efficiently. Types supported is a subset of standard XML Schema datatypes including primitives (numbers like ints, booleans, qualified xml names), arrays (of numbers) and binary (base64). I think the last may be the most important one, since it finally allows efficient, reliable and convenient transfer of inline binary data within xml. I hope to benchmark benefits in near future.
  • XML Schema Validation: in addition to existing DTD and RelaxNG validation, one can now validate content (using the same Stax2 Validation API), both when reading and writing xml. Implementation uses Sun Multi-Schema Validator as the underlying validation engine.
  • Improvement interoperability:
    • Improvement support for reading/writing DOM documents, elements: namespace-repairing mode implemented for writing. And Typed Access API works completely too (if not as efficiently as with Stax API)
    • OSGi: Woodstox jars (and included MSV jars too) are OSGi bundles
    • Maven: better structuring of jars (now Woodstox core is one jar, Stax2 API a separate and required jar)

As always: Download Responsibly! (simple rule of thumb: a Tequila Sunrise, sweet -- Tequila Sunset, one too many!)

Saturday, January 03, 2009

You're not my type? Super Type Tokens to Rescue!

Ok, so accessing real type information regading Java Generic classes and methods is problematic as per my previous blog entry.

All possible workarounds listed require creating dummy classes (3 mentioned, plus one I forgot about, that of class member/static variables); but the most promising approach is probably that of using sub-classing. The first thought I had was to perhaps require sub-classing the specific class or interface and providing type information that way. For example, to specify type "HashMap<String,String>", you would define a class like:

  public class MyDummyMap extends HashMap<String,String> { }

And then pass "MyDummyMap.class" as the type argument to whatever methods needs this type information. This would work in many cases; for abstract classes and interfaces you could just define class as abstract. After all, we don't need instances of the class, just the class to provide type information for the "Real Type" we want.

However while this does allow functionality, it is not optimal from robustness perspective: after all, type of the argument itself is just Class<?>. Further, caller has to know the class has to sub-class something, and provide full generic typing. It sure would be nice to have more descriptive type for the argument; as well as some static (and if need be, dynamic) error checking to ensure that type information is properly passed.

This is where "Super Type Tokens" (courtesy of Java Guru mr. Gafter) comes in -- and I'm glad I happened to find it when googling for solutions to the problem. The idea is obvious once you know it: instead of sub-classing the parametrized class, define an abstract reference class and sub-class that reference (which, then, is statically typed and explicitly known to be a form of type reference). It allows for ensuring that typing is provided, and with some additional tricks (such as a reader's suggestion of making TypeReference implement a dummy interface such as Comparable) can be explicit, easy-to-understand, concise and reliable way to pass exact type information.

Even better, implementation from these ideas is quite simple.

A possible implementation is as follows (this one adapted from Jackson codebase):


public abstract class TypeReference<T>
  implements Comparable<TypeReference<T>>
{
 final java.lang.reflect.Type _type;

 protected TypeReference()
 {
  java.lang.reflect.Type superClass = getClass().getGenericSuperclass();
  _type = ((ParameterizedType) superClass).getActualTypeArguments()[0];
 }

 public Type getType() { return _type; }

 // We define this method (and require implementation of Comparable
 // to prevent constructing a reference without type information.
 public int compareTo(TypeReference<T> o) {
  return 0; // dummy, never used
 }
}

And usage would be like:

Decoder d = DecoderFactory.forTypeReference(new TypeReference<HashMap<String,String>>() { } );

which seems clear enough -- even if ideally Java language would have explicit support for proper type information passing.

The only minor annoyance I have left with this is the profileration of these disposable dummy classes, only purpose of which is to contain little bit of type information. But I'll take this solution over not having any. :-)



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.