Thursday, August 08, 2013

Brief History of Jackson the JSON processor

(Disclaimer: this article talks about Jackson JSON processor -- not other Jacksons, like American cities or presidents -- those others can be found from Wikipedia)

0. Background

It occurred to me that although it is almost six years since I released the first public version of Jackson, I have not actually written much about events surrounding Jackson development -- I have written about its features, usage, and other important things. But not that much about how it came about.

Since still remember fairly well how things worked out, and have secondary archives (like this blog, Maven/SVN/Github repositories) available for fact-checking the timeline, it seems like high time to write a short(ish) historical document on the most popular OSS project I have authored.

1. Beginning: first there was Streaming

Sometime in early 2007, I was working at Amazon.com, and had successfully used XML as the underying data format for couple of web services. This was partly due to having written Woodstox, a high-performance Java XML parser. I was actually relatively content with the way things worked with XML, and had learnt to appreciate benefits of open, standard, text-based data format (including developer-debuggability, interoperability and -- when done properly -- even simplicity).
But I had also been bitten a few times by XML data-binding solutions like JAXB; and was frustrated both by complexities of some tools, and by direction that XML-centric developers were taking, focusing unnecessarily in the format (XML) itself, instead of how to solve actual development problems.

So when I happened to read about JSON data format, I immediately saw potential benefits: the main one being that since it was a Data Format -- and not a (Textual) Markup Format (like XML) -- it should be much easier to convert between JSON and (Java) objects. And if that was simpler, perhaps tools could actually do more; offer more intuitive and powerful functionality, instead of fighting with complex monsters like XML Schema or (heaven forbid) lead devs to XSLT.
Other features of JSON that were claimed as benefits, like slightly more compact size (marginally so), or better readabilty (subjective) I didn't really consider particularly impresive.
Beyond appreciating good fit of JSON for web service use case, I figured that writing a simple streaming tokenizer and generator should be easy: after all, I had spent lots of time writing low-level components necessary for tokenizing content (I started writing Woodstox in late 2003, around time Stax API was finalized)

Turns out I was right: I got a streaming parser working and in about two weeks (and generator in less than a week). In a month I had things working well enough that the library could be used for something. And then it was ready to be released ("release early, release often"); and rest is history, as they say.

Another reason for writing Jackson, which I have occasionally mentioned, was what I saw as a sorry state of JSON tools -- my personal pet peeve was use of org.json's reference implementation. While it was fine as a proof-of-concept, I consider(ed) it a toy library, too simplistic, underpowered thing for "real" work. Other alternatives just seemed to short-change one aspect or another: I was especially surprised to find total lack of modularity (streaming vs higher levels) and scant support for true data-binding -- solutions tended to either assume unusual conventions or require lots of seemingly unnecessary code to be written. If I am to write code, I'd rather do it via efficient streaming interface; or if not, get a powerful and convenient data-binding. Not a half-assed XML-influenced tree model, which was en vogue (and sadly, often still is).

And the last thing regarding ancient history: the name. I actually do not remember story behind it -- obviously it is a play on JSON. And I vaguely recall toying with the idea of calling library "Jason", but deciding that might sound too creepy (I knew a few Jasons, and didn't want confusion). Compared to Woodstox -- where I actually remember that my friend Kirk P gave the idea (related to Snoopy's friend, bird named Woodstock!) -- I actually don't really know who to give credit to the idea, or inspiration to it.

2. With a FAST Streaming library...

Having written (and quickly published in August 2007) streaming-only version of Jackson, I spent some time optimizing and measuring things, as well as writing some code to see how convenient library is to use. But my initial thinking was to wrap things up relatively soon, and "let Someone Else write the Important Pieces". And by "important pieces" I mostly meant a data-binding layer; something like what JAXB and XMLBeans are to XML Streaming components (SAX/Stax).

The main reasons for my hesitation were two-fold: I thought that

  1. writing a data-binding library will be lots of work, even if JSON lends itself much more easily to doing that; and
  2. to do binding efficiently, I would have to use code-generation; Reflection API was "known" to be unbearably slow

Turns out that I was 50% right: data-binding has consumed vast majority of time I have spent with Jackson. But I was largely wrong with respect to Reflection. But more on that in a bit.

In short term (during summer and autumn of 2008) I did write "simple" data-binding, to bind Java Lists and Maps to/from token streams; and I also wrote a simple Tree Model, latter of which has been rewritten since then.

3. ... but No One Built It, So I did

Jackson the library did get relatively high level of publicity from early on. This was mostly due to my earlier work on Woodstox, and its adoption by all major second-generation Java SOAP stacks (CXF nee XFire; Axis 2). Given my reputation for producing fast parsers, generators, there was interest in using what I had written for JSON. But early adopters used things as is; and no one did (to my knowledge) try to build higher-level abstractions that I eagerly wanted to be written.

But that alone might not have been enough to push me to try my luck writing data-binding. What was needed was a development that made me irritated enough to dive in deep... and sure enough, something did emerge.

So what was the trigger? It was the idea of using XML APIs to process JSON (that is, use adapters to expose JSON content as if it was XML). While most developers who wrote such tools consider this to be a stop-gap solution to ease transition, many developers did not seem to know this.
I thought (and still think) that this is an OBVIOUSLY bad idea; and initially did not spend much time refuting merits of the idea -- why bother, as anyone should see the problem? I assumed that any sane Java developer would obviously see that "Format Impedance" -- difference between JSON's Object (or Frame) structure and XML Hierarchic model -- is a major obstacle, and would render use of JSON even MORE CUMBERSOME than using XML.

And yet I saw people suggesting use of tools like Jettison (JSON via Stax API), even integrating this into otherwise good frameworks (JAX-RS like Jersey). Madness!

Given that developers appeared intent ruining the good thing, I figured I need to show the Better Way; just talking about that would not be enough.
So, late in 2008, around time I moved on from Amazon, I started working on a first-class Java/JSON data-binding solution. This can be thought of as "real" start of Jackson as we know it today; bit over one year after the first release.

4. Start data-binding by writing Serialization side

The first Jackson version to contain real data-binding was 0.9.5, released December of 2008. Realizing that this was going to be a big undertaking, I first focused on simpler problem of serializing POJOs as JSON (that is, taking values of Java objects, writing equivalent JSON output).
Also, to make it likely that I actually complete the task, I decided to simply use Reflection "at first"; performance should really matter only once thing actually works. Besides, this way I would have some idea as to magnitude of the overhead: having written a fair bit of manual JSON handling code, it would be easy to compare performance of hand-written, and fully automated data-binder.

I think serializer took about a month to work to some degree, and a week or two to weed out bugs. The biggest surprise to me was that Reflection overhead actually was NOT all that big -- it seemed to add maybe 30-40% time; some of which might be due to other overhead beside Reflection access (Reflection is just used for dynamically calling get-methods or accessing field values). This was such a non-issue for the longest time, that it took multiple years for me to go back to the idea of generating accessor code (for curious, Afterburner Module is the extension that finally does this).

My decision to start with Serialization (without considering the other direction, deserialization) was good one for the project, I believe, but it did have one longer-term downside: much of the code between two parts was disjoint. Partly this was due to my then view that there are many use cases where only one side is needed -- for example, Java service only every writing JSON output, but not necessarily reading (simple query parameters and URL path go a long way). But big part was that I did not want to slow down writing of serialization by having to also consider challenges in deserialization.
And finally, I had some bad memories from JAXB, where requirements to have both getters AND setters was occasionally a pain-in-the-buttocks, for write-only use cases. I did not want to repeat mistakes of others.

Perhaps the biggest practical result of almost complete isolation between serialization and deserialization side was that sometimes annotations needed to be added in multiple places; like indicating both setter and getter what the JSON property name should be. Over time I realized that this was not a good things; but the problem itself was only resolved in Jackson 1.9, much later.

5. And wrap it up with Deserialization

After serialization (and resulting 0.9.5) release, I continued work with deserialization, and perhaps surprisingly finished it slightly faster than serialization. Or perhaps it is not that surprising; even without working on deserialization concepts earlier, I had nonetheless tackled many of issues I would need to solve, including that of using Reflection efficiently and conveniently; and that of resolving generic types (which is a hideously tricky problem in Java, as readers of my blog should know by now).

Result of this was 0.9.6 release in January 2009.

6. And then on to Writing Documentation

After managing to get the first fully functional version of data-binding available, I realized that the next blocker would be lack of documentation. So far I had blogged occasionally about Jackson usage; but for the most part I had relied on resourcefulness of the early adopters, those hard-working hardy pioneers of development. But if Jackson was to become the King of JSON on Java platform, I would need to do more for it users.

Looking blog at my blog archive I can see that some of the most important and most read articles on the site are from January of 2009. Beyond the obvious introductions to various operating modes (like "Method 2, Data Binding"), I am especially proud of "There are Three Ways to Process Json!" -- an article that I think is still relevant. And something I wish every Java JSON developer would read, even if they didn't necessarily agree with all of it. I am surprised how many developers blindly assume that one particular view -- often the Tree Model -- is the only mode in existence.

7. Trailblazing: finally getting to add Advanced Features

Up until version 1.0 (released May 2009), I don't consider my work to be particularly new or innovative: I was using good ideas from past implementations and my experience in building better parsers, generators, tree models and data binders. I felt Jackson was ahead of competition in both XML and JSON space; but perhaps the only truly advanced thing was that of generic type resolution, and even there, I had more to learn yet (eventually I wrote Java ClassMate, which I consider the first Java library to actually get generic type resolution right -- more so than Jackson itself).

This lack of truly new, advanced (from my point of view) features was mostly since there was so much to do, all the foundational code, implementing all basic and intermediate things that were (or should have been) expected from a Java data-binding library. I did have ideas, but in many cases had postponed those until I felt I had time to spare on "nice-to-have" things, or features that were more speculative and might not even work; either functionally, or with respect to developers finding them useful.

So at this point, I figured I would have the luxury of aiming higher; not just making a bit Better Mousetrap, but something that is... Something Else altogether. And with following 1.x versions, I started implementing things that I consider somewhat advanced, pushing the envelope a bit. I could talk or write for hours on various features; what follows is just a sampling. For slightly longer take, read my earlier "7 Killer Features of Jackson".

7.1 Support for JAXB annotations

With Jackson 1.1, I also started considering interoperability. And although I thought that compatibility with XML is a Bad Idea, when done at API level, I thought that certain aspects could be useful: specifically, ability to use (a subset of) JAXB annotations for customizing data-binding.

Since I did not think that JAXB annotations could suffice alone to cover all configuration needs, I had to figure a way for JAXB and Jackson annotations to co-exist. The result is concept of "Annotation Introspector", and it is something I am actually proud of: even if supporting JAXB annotations has been lots of work, and caused various frustrations (mostly as JAXB is XML-specific, and some concepts do not translate well), I think the mechanism used for isolating annotation access from rest of the code has worked very well. It is one area that I managed to design right the first time.

It is also worth mentioning that beyond ability to use alternative "annotation sets", Jackson's annotation handling logic has always been relatively advanced: for example, whereas standard JDK annotation handling does not support overriding (that is; annotations are not "inherited" from overridden methods), Jackson supports inheritance of Class, Method and even Constructor annotations. This has proven like a good decision, even if implementing it for 1.0 was lots of work.

7.2 Mix-in annotations

One of challenges with Java Annotations is the fact that one has to be able to modify classes that are annotated. Beyond requiring actual access to sources, this can also add unnecessary and unwanted dependencies from value classes to annotations; and in case of Jackson, these dependencies are in wrong direction, from design perspective.

But what if one could just loosely associate annotations, instead of having to forcible add them in classes? This was the thought exercise I had; and led to what I think was the first implementation in Java of "mix-in annotations". I am happy that 4 years since introduction (they were added in Jackson 1.2), mix-in annotations are one of most loved Jackson features; and something that I still consider innovative.

7.3 Polymorphic type support

One feature that I was hoping to avoid having to implement (kind of similar, in that sense, to data-binding itself) was support for one of core Object Serialization concepts (but not necessarily data-binding concept; data is not polymorphic, classes are): that of type metadata.
What I mean here is that given a single static (declared) type, one will still be able to deserialize instances of multiple types. The challenge is that when serializing things there is no problem -- type is available from instance being serialized -- but to deserialize properly, additional information is needed.

There are multiple problems in trying to support this with JSON: starting with obvious problem of JSON not having separation of data and metadata (with XML, for example, it is easy to "hide" metadata as attributes). But beyond this question, there are various alternatives for type identifiers (logical name or physical Java class?), as well as alternative inclusion mechanisms (additional property? What name? Or, use wrapper Array or Object).

I spent lots of time trying to figure out a system that would satisfy all the constraints I put; keep things easy to use, simple, and yet powerful and configurable enough.
It took multiple months to figure it all out; but in the end I was satisfied with my design. Polymorphic type handling was included in Jackson 1.5; less than one year after release of 1.0. And still most Java JSON libraries have no support at all for polymorphic types: or at most support fixed use of Java class name -- I know how much work it can be, but at least one could learn from existing implementations (which is more than I had)

7.4 No more monkey code -- Mr Bean can implement your classes

Of all the advanced features Jackson offers, this is my own personal favorite: and something I had actually hoped to tackle even before 1.0 release.

For full description, go ahead and read "Mr Bean aka Abstract Type Materialization"; but the basic idea is, once again, simple: why is it that even if you can define interface of your data type as a simple interface, you still need to write monkey to code around it? Other languages have solutions there; and some later Java Frameworks like Lombok have presented some alternatives. But I am still not aware of a general-purpose Java library for doing what Mr Bean does (NOTE: you CAN actually use Mr Bean outside of Jackson too!).

Mr Bean was included in Jackson 1.6 -- which was a release FULL of good, innovative new stuff. The reason it took such a long time for me to build was hesitation -- it is the first time I used Java bytecode generation. But after starting to write code I learnt that it was surprisingly easy to do; and I just wished I had started earlier.
Part of simplicity was due to the fact that literally the only thing to generate were accessors (setters and/or getters): everything else is handled by Jackson, by introspecting resulting class, without having to even know there is anything special about dynamically generated implementation class.

7.5 Binary JSON (Smile format)

Another important milestone with Jackson 1.6 was introduction of a (then-) new binary data format called Smile.

Smile was borne out of my frustration with all the hype surrounding Google's protobuf format: there was tons of hyperbole caused by the fact that Google was opening up the data format they were using internally. Protobuf itself is a simple and very reasonable binary data format, suitable for encoding datagrams used for RPC. I call it "best of 80s datagram technology"; not as an insult, but as a nod to maturity of the idea -- it is automating things that back in 80s (and perhaps earlier) were hand-coded whenever data communication was needed. Nothing wrong in there.

But my frustration had more to do with creeping aspects of pre-mature optimization; and the myopic view that binary formats were the only way to achieve acceptable performance for high-volume communication. I maintain that this is not true for general case.

At the same time, there are valid benefits from proper use of efficient binary encodings. And one approach that seemed attractive to me was that of using alternative physical encoding for representing existing logical data model. This idea is hardly new; and it had been demonstrated with XML, with BNUX, Fast Infoset and other approaches (all that predate later sad effort known as EXI). But so far this had not been tried with JSON -- sure, there is BSON, but it is not 1-to-1 mappable to JSON (despite what its name suggest), it is just another odd (and very verbose) binary format.
So I thought that I should be able to come up with a decent binary serialization format for JSON.

Timing for this effort was rather good, as I had joined Ning earlier that year, and had actual use case for Smile. At Ning Smile was dynamically used for some high-volume systems, such as log aggregation (think of systems like Kafka, Splunk). Smile turns out to work particularly well when coupled with ultra-fast compression like LZF (implemented at and for Ning as well!).

And beyond Ning, I had the fortune of working with creative genius(es) behind ElasticSearch; this was a match made in heaven, as they were just looking for an efficient binary format to complement their use of JSON as external data format.

And what about the name? I think I need to credit mr. Sunny Gleason on this; we brainstormed the idea, and it came about directly when we considered what "magic cookie" (first 4 bytes used to identify format) to use -- using a smiley seemed like a crazy enough idea to work. So Smile encoded data literally "Starts With a Smile!" (check it out!)

7.6 Modularity via Jackson Modules

One more major area of innovation with Jackson 1.x series was that of introduction of "Module" concept in Jackson 1.7. From design/architectural perspective, it is the most important change during Jackson development.

The background to modules was my realization that I neither can nor want to be the person trying to provide Jackson support for all useful Java libraries; for datatypes like Joda, or Collection types of Guava. But neither should users be left on their own, to have to write handlers for things that do not (and often, can not) work out of the box.

But if not me or users, who would do it? The answer of "someone else" does not sound great, until you actually think about it a bit. While I think that the ideal case is that the library maintainers (of Joda, Guava, etc) would do it, I think that the most likely case is that "someone with an itch" -- developer who happens to need JSON serialization of, say, Joda datetime types -- is the person who can add this support. The challenge, then, is that of co-operation: how could this work be turned to something reusable, modular... something that could essentially be released as a "mini-library" of its own?

This is where the simple interface known as Module comes in: it is simply just a way to package necessary implementations of Jackson handlers (serializers, deserializers, other components they rely on for interfacing with Jackson), and to register them with Jackson, without Jackson having any a priori knowledge of the extension in question. You can think them of Jackson equivalent of plug-ins.

8. Jackson 2.x

Although there were more 1.x releases after 1.6, all introducing important and interesting new features, focus during those releases started to move towards bigger challenges regarding development. It was also challenging to try to keep things backwards-compatible, as some earlier API design (and occasionally implementation) decisions proved to be sub-optimal. With this in mind, I started thinking about possibility of making bigger change, making a major, somewhat backwards-incompatible change.

The idea of 2.0 started maturing at around time of releasing Jackson 1.8; and so version 1.9 was designed with upcoming "bigger change" in mind. It turns out that future-proofing is hard, and I don't know how much all the planning helped. But I am glad that I thought through multiple possible scenarios regarding potential ways versioning could be handled.

The most important decision -- and one I think I did get right -- was to change the Java and Maven packages Jackson 2.x uses: it should be (and is!) possible to have both Jackson 1.x and Jackson 2.x implementations in classpath, without conflicts. I have to thank my friend Brian McCallister for this insight -- he convinced me that this is the only sane way to go. And he is right. The alternative of just using the same package name is akin to playing Russian Roulette: things MIGHT work, or might not work. But you are actually playing with code of other people; and they can't really be sure whether it will work for them without trying... and often find out too late if it doesn't.

So although it is more work all around for cases where things would have worked; it is definitely much, much less work and pain for cases where you would have had problems with backwards compatibility. In fact, amount of work is quite constant; and most changes are mechanical.

Jackson 2.0 took its time to complete; and was released February 2012.

9. Jackson goes XML, CSV, YAML... and more

One of biggest changes with Jackson 2.x has been the huge increase in number of Modules. Many of these handle specific datatype libraries, which is the original use case. Some modules implement new functionality; Mr Bean, for example, which was introduced in 1.6 was re-packaged as a Module in later releases.

But one of those Crazy Ideas ("what if...") that I had somewhere during 1.x development was to consider possibility of supporting data formats other than JSON.
It started with the obvious question of how to support Smile format; but that was relatively trivial (although it did need some changes to underlying system, to reduce deep coupling with physical JSON content). Adding Smile support lead me to realize that the only JSON-specific handling occurs at streaming API level: everything above this level only deals with Token Streams. So what if... we simply implemented alternative backends that can produce/consume token streams? Wouldn't this allow data-binding to be used on data formats like YAML, BSON and perhaps even XML?

Turns out it can, indeed -- and at this point, Jackson supports half a dozen data formats beyond JSON (see here); and more will be added over time.

10. What Next?

As of writing this entry I am working on Jackson 2.3; and list of possible things to work on is as long as ever. Once upon a time (around finalizing 1.0) I was under false impression that maybe I will be able to wrap up work in a release or two, and move on. But given how many feature-laden versions I have released since then, I no longer thing that Jackson will be "complete" any time soon.

I hope to write more about Jackson future ... in (near I hope) future. I hope above gave you more perspective on "where's Jackson been?"; and perhaps can hint at where it is going as well.

Tuesday, February 15, 2011

Basic flaw with most binary formats: missing identifiable prefix (protobuf, Thrift, BSON, Avro, MsgPack)

Ok: I admit that I have many reservations regarding many existing binary data formats; and this is major reason why I worked on Smile format specification -- to develop a format that tries to address various deficiencies I have observed.

But while the full list of grievances would be long, I realized today that there is one basic design problem that is common to pretty much all formats -- at least Thrift, protobuf, BSON and MsgPack -- that is: lack of any kind of reliable, identifiable prefix. Commonly used techniques like "magic number", which is used to allow reliable type detection for things like image formats appears unknown to binary data format designers. This is a shame.

1. The Problem

Given a piece of data (file, web resource), one important piece of metadata is its structure. While this is often available explicitly from the context, this is not always the case; and even if it could be added there are benefits to being able to automatically detect type: this can significantly simplify systems, or to extend functionality by accepting multiple kinds of formats. Various graphics programs, for example, can operate on different image storage formats, without necessarily having any metadata available beyond just actual data.

So why does this matter? It helps in verifying basic correctness of interacton in many cases: if you can detect what is and what is not valid piece of data in a format, life is much easier: you have a chance to know immediately when piece of data is completely corrupt, or you are being fed data in some format than the one you expect. Or, if you support multiple formats, you can add automatic handling of differences.

2. Textual formats do it well

But let's go back to commonly used textual data formats: XML and JSON. Of these, XML specifies "xml declaration" which can be used to not only determine text encoding (UTF-8 etc) used but also the fact that data is XML. It is cleanly designed and is simple to implement. As if it was designed by people who knew what they were doing.

JSON does not define such a prefix, but specification does specify exact rules for detecting valid JSON, as well as encodings that can be used; so in practice JSON auto-detection is as easy to implement as that for XML.

3. But most new binary formats don't

Now; the task of defining unique (enough) header for binary formats would be even easier than that for textual formats, because structurally there is less variance: no need to allow variable text encoding, arbitrary white spaces, or other lexical sugar. It took me very little time to figure out the simple schema used by Smile to indicate its type (which in itself was inspired by design of PNG image format, an example of very good data format design).

So you might think that binary formats would excel in this area. Unfortunately, you would be wrong.

As far as I can see, following binary data formats have little or no support for type detection:

  • Thrift does not seem to have type identifier at its format layer. There is actually small amount of metadata at RPC level (there is a message-start structure of some kind), but this only helps if you want/need to use Thrift's RPC layer. Another odd things is that internal API actually exposes hooks that would be used to handle any type idenfitiers; it is as if designers were at least aware of possibility of using some markers to enclose main-level data entities.
  • protobuf does not seem to have anything to allow type detection of a given blob of protobuf data. I guess protobuf never claimed to be useful for anything beyond tightly coupled low-level system integration (although some clueless companies are apparently using it for data storage... which just plain old Bad Idea), so maybe I could buy argument that this is just not needed, that there is never any "arbitrary protobuf data" around. Still... adding a tiny bit of redundancy would make sense for diagnostics purposes; and given that protobuf already has some redundancy (field ids, instead of using ordering) it would seem acceptable to use first 2 or 4 bytes for this.
  • MsgPack and BSON both just define "raw" encoding, without any format identifier that I can see. This is especially puzzling since unlike protobuf and Thrift, they do not require a schema to be used; that is, they have plenty of other metadata (types, names of struct members; even length prefixes). So make these data formats completely unidentifiable?

4. But what about Avro?

There is one exception aside from Smile, however. Avro seems to do the right thing (as far as I can read the specification) -- at least when explicitly storing Avro data in a file (I assume including map/reduce use cases, stored in HDFS): there is a simple prefix to use, as well as requirement to store the schema used. This makes sense, since my biggest concern with formats like protobuf and Thrift is that being "schema-ridden", data without schema is all but useless. Requiring that two are bundled -- when stored -- makes sense; optimizations can be used for transfer.

So Avro definitely seems better design than 4 other binary data formats listed above in this respect.

5. Why do I care?

As part of my on-going expansion of Jackson ("the universal data processor"), I am thinking of adding many more backends (to support reading and writing data in alternate data formats), to allow clean and efficient data binding to/from most any commonly used data formats. Ideally this would include binary data formats. Current plans are to include format detection functionality in such a way that new codecs can detect data they are capable of reading and writing; and this will work just fine for most existing formats that Jackson can handle (JSON, Smile, XML). I also assumed that since it would be very easy to design data formats that can be reliably detected, existing formats should be a piece of cake to detect. It is only when I started digging into details of binary data formats that the sad reality sunk in...

On plus side, this makes it easier to focus on adding first rate support for data formats that are easy to detect. So I will probably prioritize Avro compatibility significantly higher than others; and I will unfortunately have to downgrade my work on adding Thrift support which would otherwise be the most important "alien" format to support (due to existing use by infrastructure I am working on).

Sunday, February 06, 2011

On prioritizing my Open Source projects, retrospect #2

(note: related to original "on prioritizing OS project", as well as first retrospect entry)

1. What was the plan again?

Ok, it has been almost 4 months since my last medium-term high-level priorization overview. Planned list back then had these entries:

  1. Woodstox 4.1
  2. Aalto 1.0 (complete async API, impl)
  3. Jackson 1.7: focus on extensibility
  4. ClassMate 1.0
  5. Externalized Mr Bean (not dependant on Jackson)
  6. StaxMate 2.1
  7. Tr13 1.0

2. And how have we done?

Looks like we got about half of it done. Point by point:

  1. DONE: Woodstox 4.1 (with 4.1.1 patch release)
  2. Almost: Aalto 1.0 -- half-done; but significant progress, API is defined, about half of implementation work done
  3. DONE: Jackson 1.7 (with 1.7.1 and 1.7.2 patch releases)
  4. Almost: ClassMate 1.0 not completed; version 0.5.2 released, javadocs publisher, minor work remains
  5. Deferred: Externalized Mr Bean -- no work done (only some preliminary scoping)
  6. DONE? StaxMate 2.1 -- released 2.0.1 patch instead that contains fixes to found issues, but no new features, which would defined 2.1.
  7. Some work done: Tr13: incremental work, but no definite 1.0 release (did release 0.2.5 patch version with cleanup)

I guess it is less than half since only 2 things were fully completed (or 3 if StaxMate 2.0.1 counts). But then again, of remaining tasks only one did not progress at all; and many are close to being completed (in fact, I was hoping to wrap up Aalto before doing update). And ones referred were lower entries on the list.

On the other hand, I did work on a few things that were not on the list. For example:

  • Started "jackson-xml-databinding" project (after Jackson 1.7.0), got first working version (0.5.0)
  • Started multiple other Jackson extension projects (jackson-module-hibernate, jackson-module-scala), with working builds and somewhat usable code; these based on code contributed by other Jackson developers
  • Started "java-cachemate" project, designed concept and implemented in-memory size-limited-LRU-cache (used already in a production system)

This just underlines how non-linear open source development can be; it is often opportunistic -- but necessarily in negative way -- and heavily influenced by feedback, as well as newly discovered inter-dependencies, and -opportunities.

3. Updated list

Let's try guestimating what to do going forward, then, shall we. Starting with leftovers, we could get something like:

  • Aalto 1.0: complete async implementation; do some marketing
  • ClassMate 1.0: relatively small amount of work (expose class annotations)
  • Java CacheMate: complete functionality, ideally release 1.0 version
  • Tr13: either complete 1.0, or augment with persistence options from cachemate (above)
  • Externalized Mr Bean? This is heavily dependant on external interest
  • Jackson 1.8: target most-wanted features (maybe external type id, multi-arg setters)
  • Jackson-xml-databinding 1.0: more testing, fix couple known issues
  • Work on Smile format; try to help with libsmile (C impl), maybe more formal specification; performance measurements, other advocacy; maybe even write a javascript codec

Other potential work could include:

  • StaxMate 2.1 with some new functionality
  • Woodstox 5.0, if there is interest (raise JDK minimum to 1.5, maybe convert to Maven build)
  • Jackson-module-scala: help drive 1.0 version, due to amount of interest in full Scala support
  • Jackson-module-csv: support data-binding to/from CSV -- perhaps surprisingly, much of "big data" exists as plain old CSV files...

But chances are that above lists are also incomplete... let's check back in May, on our first "anniversary" retrospect.

Thursday, February 03, 2011

Why do modularity, extensibility, matter?

After writing about Jackson 1.7 release, I realized that while I described what and how was done to significantly improve modularity and extensibility of Jackson, I did not talk much about why I felt both were desperately needed. So let's augment that entry with bit more background, fill in the blanks.

Two things actually go together such that while modularity in itself is somewhat useful, it is extremely important when it is coupled with extensibility (and conversely it is hard to be extensible without being modular). So I will consider them together, as "modular extensibility", in what follows.

1. Distributed development

The most obvious short-term benefit of better modularization, extensibility, is that it actually allows simple form of distributed development, as additional extension modules (and projects under which they are created) can be built independent from the core project. There are dependencies, of course -- modules may need certain features of the core library -- but this much looser coupling than having to actually work within same codebase, coordinating changes. This alone would be worth the effort.

But the need for distribution stems from the obvious challenge with Jackson's (or any smilar project's) status quo: that the core project, and its author (me) can easily become a bottleneck. This is due to coordination needed, such as code reviews, patch integration; much of which is most efficiently done with simple stop-and-wait'ish approach. While it is possible to increase concurrency within one project and codebase (with lots of additional coordination, communication, both of which are hard if activity levels of participants fluctuate), it is much easier and more efficient to do this by separate projects.

Not all projects can take the route we are taking, since one reason such modularity is possible is due to expansion of the project scope: extensions for new datatypes are "naturally modular" (conceptually at least; implementation-wise this is only now becoming true), and similarly support for non-Java JVM languages (Scala, Clojure, JRuby) and non-JSON data formats (BSON, xml, Smle). But there are many projects that could benefit from more focus on modular extensibility.

2. Reduced coupling leads to more efficient develo[ment

Reduced coupling between pieces of functionality in turn allows for much more efficient development. This is due to multiple factors: less need for coordination; efficiency in working on smaller pieces (bigger projects, as companies, have much more inherent overhead, lower productivity); shorter release cycles. Or, instead of canonically shorter development and release cycles, it is more accurate to talk about more optimal cycles: new, active projects can have shorter cycles, release more often, and more mature, slower moving (or ones with more established user base and hence bigger risks from regression) can choose slower pace. The key point is that each project can choose most optimal rate of releases, and only synchronize when some fundamental "platform" functionality is needed.

As an example, core Jackson project has released a significant new version every 3 - 6 months. While this is pretty respectable rate in itself, it is glacial pace compared to releases for, say, "jackson-xml-databinding" module, which might release new versions on weekly basis before reaching its 1.0 version.

3. Extending and expanding community

This improved efficiency is good just in itself, but I think it will actually make it easire to extend and expand community. Why? Because starting new projects and getting releases out faster should make it easier to join, get started and productive, and thereby lower threshold for participation. In fact I think that we are going to quickly double and quadruple number of active contributors quite soon, when everyone realizes potential for change; how easy it is to get to expand functionality in a way that everyone can share the fruits of labor. Previously best methods have been to write a blog entry about using a feature, or maybe report a bug; but now it will be trivially easy to start playing with new kinds of reusable extension functionality.

4. Modules are the new core

Given all the benefits of the increased modularity I am even thinking of further splitting much of existing "core" (meaning all components under main Jackson project; core, mapper, xc, jax-rs, mrbean, smile) as modules. All jars except for core and mapper would themselves work as modules (or similar extensions); and many features of mapper jar could be extracted out. The main reason for doing this would actually be to allow different release cycles: jax-rs component, for example, has changed relatively little since 1.0: there is no real need to release new version of it every time there is a new mapper version. In fact, of 6 jars, mapper is the only one that is constantly changing; others have evolved at much slower pace.

But even if core components were to stay within core Jackson project, most new extension functionality to be written will be done as new modules.

Saturday, January 08, 2011

On Perception of Java Verbosity

Today many software developers consider Java to be the modern-day equivalent of Cobol. This is evident from comments comparing amount of Java code needed to do tasks that can be written as one-liners using more dynamic and expressive scripting languages such as Python or Ruby. Funny how time flies -- it wasn't all THAT long ago that Java was seen as relatively concise language compared to C, due to its in-built support for things like garbage collection and standard library that contained implementations for host of things that in C were DIY (note that I did not say "due to simplicity of language itself")

1. Java verbose?

But while it is true that Java syntax can lead to code much more verbose than seems prudent (especially when traversing and modifying data structures), sometimes its reputation exceeds reality. I was reminded by this by a tweet I came across. The tweet asked "and how many lines would this be in Java", regarding a task of downloading JSON from a URL and parsing contents to extract data; something that can be done with a single line of Python (or Ruby or Perl). Implied assumption being be that it would take many more lines of Java code.

2. Ain't necessarily so

This assumption is not completely baseless: if a developer was to do this as part of a service, a typical java developer might well end up with code that exceeded ten lines; and this even without code itself being badly written. I will come back to question of "why" in a minute.

But assumption is also off base, for the simple reason that it can be a one-liner even in Java; for example:


Response resp = new ObjectMapper().readValue(new URL("http://dot.com/api/?customerId=1234").openStream(),Response.class);
// or if you prefer, bind similarly as "Map<String,Object>"

(and in fact, ".openConnection()" is actually unnecesary, as ObjectMapper can just take URL -- but if it didn't, one can open InputStream directly from URL, which sends request, takes response and so forth).

Code snippet just uses standard JDK URLConnection via URL, and a JSON library (Jackson in this case, but might as well be, GSon, flex-json, whatever); and results in request being made, contents read, parsed and bound to an object of caller's choosing, either a Plain Old Java Object, or simple Map.

Given that it IS that simple, why was there assumption that something more was needed?

3. But often is

Above use case happens to be doable in quite concise form; but there are other tasks where Java equivalent ends up being either a call to a very specific library tailored to condense usage, or is much fluffier than equivalents in modern scripting languages. But I don't think this is the main reason for the universal appearance of Java's bloatedness, i.e. it is not just case of choosing a wrong example.

I think it is because most Java developers would actually write piece of code that spanned more than a dozen lines of code. Why? Either because:

  1. They didn't know JDK or libraries, and use much more cumbersome methods (case for less experienced developers)
  2. They actually understand complexities of the task, within context where task needs to be done.

First one is easy to understand: if you don't know your tools, you can't expect a good outcome. But second point needs more explanation.

Let's consider the same task of sending a request to a service that returns a JSON response that we need to return as an object. What possible additional things should we cover, beyond what one-liner did? Here's sampling of possible issues:

  • There is no error handling in code snippet: if there are transient problems with connection, it will just fail for good, regardless of type of problem there is
  • How about problems with service itself? Requesting unknown customer? Do we get an HTTP error response; different JSON or what?
  • Do we really want to wait for unspecified amount of time, if request can not be made (TCP will try its damnest to connect, so it there is an outage it'll be minutes before anything fails)
  • URL to connect to is fixed (and hard-coded), including parameters to send; should they really be hard-coded
  • How is caching handled? What are connection details?
  • When there are failures, who is notified and how?
  • Are we happy with the default JDK URLConnection? It may not work all that well for some use cases (i.e. shouldn't be using Apache httpclient or something)

To cover such concerns for production systems, one probably would want much more complicated handling: possible retries for transient errors; definitely logging to indicate hard failures; way to handle error responses and indicates those to caller. Due to testing, end points being used are typically dynamically determined and passed; connection settings may need to be changed, and sometimes different parameters need to be sent. And for production systems we probably need more caching; whereas during testing we may want to disable any and all caching.

Since there are often many more aspects to cover, there is then tendendy to wrap all calls within helper objects or functionality; and if we did define something like "fetchJSONDataFromURL()", it surely would end up being more than dozen of lines of code. Yet calling functionality might still be no longer than a single Java statement.

So which one should we focus on? Helper method that is, say 50 lines long; or call to use it, which may be a one-liner? Former is what can be used to "prove" how bloated Java code is; yet it is written just once, whereas one-liners to use it are written ideally much more often.

By the way, above is not meant to say that it is ALWAYS necessary to handle all kinds of obscure error modes, or to create perfect system that is as efficient as possible. It is clearly not, and Java developers seem especially prone to over-complicating and -engineering solutions. But in other cases, happy-go-lucky approach (that I would claim is more common with "perl scripters") won't do. This is just a long way of saying that complexity of code should be based on actual requirements; and that those requirements vary widely.

4. Concise Java by Composition

I think my insight (if any) here is this: since Java, the language, offers relatively in way of writing compact code, economical source code must come from proper use of libraries, as well as design of those libraries. Furthermore, I think many Java developers have started wrongly believing that Java code must be verbose; and that this makes perception more of a self-fulfilling prophecy. This means that to write compact Java code one absolutely MUST be familiar with libraries to use for things that JDK does not support well (or at all).

Wednesday, December 22, 2010

Experiments with advertising, Adsense vs Adbrite, experience so far

It has been a while -- almost 6 months, to be precise -- since I decided to see if there is more to on-line advertising than venerable Google AdSense. So it is time to see if I have learnt anything.

1. Summary: gain some, lose some

An overall verdict is pretty much inconclusive: I like some aspects (more control, less fluctuation with revenue); but from strictly monetary view point, change is mixed bag. Fortunately revenue we are talking about is in "trivial" range -- enough so that it does not round down to zero, but not enough to pay for hosting at current rates. So I can freely do whatever I want without risking losing any "real money". I might as well just gain StackOverflow credits.

2. Positive

Overall the main positive aspect for me is the feeling of empowerment: AdBrite gives more control for publisher, from controlling what to display to defining minimum bids, and even allowing fallbacks (typically, to, what else but AdSense!). I like this a lot, and would assume it is a non-trivial competitive advantage as well: whereas for me control is more of a nice-to-have, I know for a fact that bigger "serious" publisher REALLY want to have more control. This is most important for publishers with valuable brand to take care of.

Another smallish positive thing is that since most advertisements are cost-per-display (aka cost-per-thousand == CPM), and NOT cost-per-click (CPC), revenue stream is steadier. With AdSense, your revenue typically fluctuates wildly, unless you get lots of direct placements.

3. Negative

On downside, "guaranteed" revenue from CPM is not particularly high. In fact, little CPM that I have seen from AdSense for others sites (not this blog) is typically in the same range as what I can get from AdBrite for majority of views (AB does have wider range of CPMs, based on viewer profiling); and even if readers are often skimpy clickers, whenever there are clicks it is typically worth more than two or three thousand CPM views. So overall it is possible that AdSense might actually pay more, over time (with caveat from above that either way, very little money will change hands :-) ).

4. Other

Oh. One thing I was hoping to see was wider selection of interesting ads to display; not just same old, same old. This may or may not be true: I think that overall selection may be wider (just from looking at all the ads that get displayed, via publisher management console), but selection for individual profile is still rather limited. So I don't know if it's very different from what Google would give. I guess it makes sense, in a way, that algorithms tend to over-fit ads with (IMO) too little randomity. But I really would like some more variation, personally.

5. Conclusions

It has been merely quite interesting ride. Perhaps I should check out potentially other choices? Too bad most alternatives seem to be just obnoxious irritating block-the-whole-page scams, or things that try to take over links, images; things that I would personally hate to see. I have no plans to introduce anything like that. But as usual, I am open to things that fit in well enough; something similar to AdSense or AdBrite ad systems, I guess.

Monday, December 13, 2010

Amazon Web Service (AWS), WikiLeaks: series of unfortunate events

As a current Amazon Web Service customer (as well as ex-employee of Amazon) I was sad to see reports of AWS mishandling of its WikiLeaks hosting.
My main objection is not regarding whether AWS should host the content or not, and I understand that due to self-service nature sometimes termination need to occur after customer relationship has been initially established. But the way termination came about was a complete cluster and really makes me wonder if I want to continue using AWS or even recommend it to others.

As far as I understand, basic facts are that:

  1. WikiLeaks started hosting content with AWS
  2. AWS was contacted by Posturing and Angry US politician(s) who wants to fight WikiLeaks using intimidation tactics ("you are either with us, or you are with.... terrorists!"). Sort of like, you know, people who use terror as a weapon to further their agenda.
  3. Shortly afterwards AWS terminated hosting of said content citing "probable cause for copyright infringement", without actual request for doing so (pro-actively) -- essentially claiming WikiLeaks was "guilty until proven innocent", but without giving them a chance to present any proof.

Now: the way I see it, one of two things happened to effect step 3: either Amazon agreed to do what Lieber"man" et al asked (but lied about not having done that); or Amazon wanted to pro-actively tackle issue they knew would become problematic (opportunistic), using some suitable weasel-word section from the contract.

What should have happened is simple: AWS should have done nothing before officials presented them with a court order or valid cease-and-desist letter (or whatever equivalent is done for Patriot Act requests); and if that happened, publicly announce what they did and why. This is what other companies have done (Google, Yahoo). Or in cases of copyright infringement, similar demand by (alleged) copyright holder, accompanied with court order or whatever DMCA requires. One would think this would be easy for government to do as well for content it has produced.

So why did this not happen? Since I have no idea what sort of backchannel communication resulted in what happened, best I can do is speculate. My two favourite suggestions are that either someone called favors; or that some mid-level manager made a panic decision.

Painful, very painful to watch. It's as if someone gave themselves a wedgie just to prevent bullies from doing it...

Tuesday, October 12, 2010

Look back on "prioritizing Open Source projects" (from May 2010)

It has been more than 4 months since I wrote about my experiences with priorization for Open Source projects, it seems like good time to see how things have been moving.

Looks like there are two ways to look at things -- whether glass is half full or half empty -- as I have pretty much completed 50% of tasks; but not necessarily in order of priority. And this even thought I publicly outlined priorities.

One positive thing is that the top entry (Java UUID Generator 3.) was just completed; and the second entry (Woodstox 4.1) is nicely in progress, to be completed within a month or two. On the other hand, other two completed tasks (both related to Jackson 1.6 that was completed a month ago) were entries listed as having the lowest priority. Some entries not on the list were also completed; specifically work with Async HTTP Client and OAuth signature calculation.

I guess I think this is reasonable outcome, as priority lists for my "hobby" development are there to help and assist, not to drive to specific business goals or to rein in my creativity. So even more important than getting things done in "right order" is that things do get done. So as long as more important or more urgent things are more likely to get worked on than less important or urgent things, overall efficiency remains brutally high, which is the way I like it. Finally, part of the reason for fluctuating order of execution is due to some tasks being more interesting than others; and working on "most interesting" things tends to maximize amount of progress (in contrasts to working on less interesting but more highly prioritized things).

But to get some closure on this entry, let's consider this a completed 4-month Scrum and create an updated priority list. Here's what it might look like:

  1. Complete Woodstox 4.1 (XML Schema, other user requested features) -- carry-over from the original list
  2. Aalto 1.0: finalize async API, implementation
  3. Jackson 1.7: focus on extensibility (module registration, contextual serializers)
  4. ClassMate (1.0?) -- library for fully resolving generic types; based on Jackson code
  5. External version of Mr Bean (from Jackson 1.6)
  6. StaxMate 2.1? (from the original list)
  7. Tr13 1.0? (from the original list)
  8. ... and then re-consider

This is an incomplete list and I expect roughly similar completion rate if I was to look back again in 4 months. Maybe I should start doing quarterly project reviews just for fun. :-)

Tuesday, June 29, 2010

Experiments in advertising, here goes nothing (aka Welcome, AdBrite!)

Ok let's talk about something that is quite visible to you dear readers, but something that you have probably managed to ignore automatically. Yes, I am taking about those commercial decorations on margin of these pages. But please, don't change the channel quite yet. :-)

1. Advertising Changes... yay!

So what's up there? After being a very small AdSense publisher for few years, I figured that I might well retire before ever seeing another check for ads displayed on this blog; so it might be time to explore options: if not to get higher yields then at least maybe get more interesting ads. I also generally root for underdogs, and at this point Google is the ultimate uber-dog if there ever was one. So why not partner up with some other advertising puppies.

Given these loose goals about the only criteria for finding a replacement would be that it is not Google. And, well, ideally it should not be Apple, and preferably not Microsoft. But latter two are negotiable constraints (in fact, I am tempted to check out M$'s PubCenter; if for nothing else due to its catchy name!).

2. So... ?

But enough background discussion: in the end, I decided to change my ad provider from the big G to an unknown-before-about-a-week-ago company called AdBrite. Mostly because they topped this Handy List of Google Adsense Alternatives. And finally, as of today, I bothered to change blog templates for the change to take effect.

3. Can hardly contain my excitement <yawn>

At this point I am curious to see to what kind of ads they might be pushing to my blog. I sort of wish it was something that lots of people found totally repugnant yet completely fascinating... but chances for that are probably low. We'll see -- maybe I need to cycle through variety of ad sales networks before choosing my poison.

4. Commercial Proposal by Author

By the way, if anyone actually wants to actually advertise here -- buy a section for month-by-month advertising, selling something that actually relates to something I have written about -- let me know. I am open to bids and can show google analytics statistics for pricing, so you have a fair idea of what you'd get.
The only limit I will put is that monthly ad space rental fee has to be non-zero positive number in full US dollars. :-)
(you can consider it as the auction starting price)

Friday, April 09, 2010

Rock on Kohsuke!

Term "Rock start programmer" is thrown around casually when discussing best software developers. But as with music, true stars are few and far between. While knowing the lifestyle can help, you got to have the chops, be able to influence and inspire others, and obviously deliver the goods to fill the stadiums, and data centers.

In Java enterprise programming world there are few more worthy of being called a rock star than Kohsuke Kawaguchi. List of projects he has single-handedly built is vast; list of projects he has contributed to immense, and his coding speed mighty fast (as confirmed by his use of term POTD, Project of the Day -- very very few individuals write sizable systems literally in a day!). It all makes you wonder whether he is actually a mere human being at all (maybe he's twin brother of Jon Skeet?!). For those not in the know, list of things he has authored or contributed to contains such programming pearls such as Multi-Schema Validator, Sun JAXB (v2) and JAX-WS implementations, Hudson, Maven, Glassfish, Xerces, Args4j, Com4j, and so on and on (for a more complete list, check out his profile at Ohloh; read and weep)

But to the point: it seems that mr. Kawaguchi is now moving on from sinking ship formerly known as Sun. This is not a sad thing per se (we all gotta move on at some point), nor unexpected -- steady stream of Sun people leaving Oracle has been and wil be going on for a while -- but it still feels strange. End of an era in a way; gradual shutting down of Sun brand. Image of a lonely cowboy riding against Sun settings (pun intended) comes to mind.

Anyway: rock on Kohsuke, onnea & lycka till! I look forward to seeing exactly what awesomeness you will come up with next!

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.