Sunday, November 17, 2013

Jackson 2.3.0 released -- quick feature overview

Now that Jackson 2.3.0 is finally finalized and released (official release date 14th November, 2013), it is time for a quick sampling of new features. Note that this is a very limited sampling -- across all core components and modules, there are close to 100 closed features; some fixes, but most improvements of some kind.

So here's my list of 6 notable features.

1. JsonPointer support for Tree Model

One of most often asked features for Jackson has been ability to support a path language to traverse JSON. So with 2.3 we chose the simplest standardized alternative, JSON Pointer (version 3), and made Tree Model (JsonNode) allow navigating using it.

Usage is simple: for JSON like


{
"address" : { "street" : "2940 5th Ave", "zip" : 980021 }, "dimensions" : [ 10.0, 20.0, 15.0 ]
}

you could use expressions like:


JsonNode root = mapper.readTree(src);
int zip =root. at("/address/zipcode").asIntValue();
double height = root.add("/dimensions/1").asDoubleValue(); // assuming it's the second number in there

Also note that you can pre-compile JSON Pointer expressions with "JsonPointer.compile(...)", instead of passing Strings; however, pointer expressions are not particularly expensive to tokenize. JsonPointer instances also have full serialization and deserialization support, so you can conveniently use them as part of configuration data for things like, say, DropWizard Configuration objects.

2. @JsonFilter improvements

With earlier versions, it was only possible to define ids of filters to apply using @JsonFilter on classes. With 2.3.0 you can apply this annotation (as well as @JsonView) on properties as well, to use different filters for different instances of same class:


public class POJO {
@JsonFilter("filterA") public Value value1;
@JsonFilter("filterB")public Value value2;

// similarly with JsonView (was added in 2.2)
@JsonView(AlternateView.class) public AnotherValue property;
}

But this is not all! Applicability of JSON Filters is also expanded, so that in addition to regular POJOs and their properties, it will now apply to:

  1. Any getters (they will get filtered just like regular properties)
  2. Java Maps

and in future we may also add ability to filter out List or array elements; this will be possible since the filtering interfaces were extended allow alternate calls that specify value index, instead of property name.

3. Uniqueness checks

Although JSON specification does not mandate enforcing uniqueness of Object property names (and use of databinder on serialization should prevent generation of duplicates), there are situations where one would want to be extra careful, and make parser check uniqueness.

With 2.3.0, there are two new features you can turn on to cause an exception to be thrown if duplicate property names are encountered:

  1. DeserializationFeature.FAIL_ON_READING_DUP_TREE_KEY: when reading JSON Trees (JsonNode), and encountering a duplicate name, throw a JsonMappingException
  2. JsonParser.Feature.STRICT_DUPLICATE_DETECTION and JsonGenerator.Feature.STRICT_DUPLICATE_DETECTION: when reading or writing JSON using Streaming API (either directly, or for data-binding or building/serializing Tree instances), duplicates will be reported by a JsonParsingException

The main difference (beyond applicability; first feature only affects cases of building a JsonNode out of JSON input) is that duplicate detection at Streaming API level does incur some overhead (up to 30-40% more time spent); whereas duplicate detection at Tree Model level has little if any overhead. Difference is due to additional storage and checking requirements, as Streaming API does not need to keep track of set of property names encountered unless checking is required, whereas Tree Model will have to keep track of properties anyway. As a consequence, tree level checks are basically close to free to add.

4, Object Id handling

There are two improvements to Object Identity handling.

First, by enabling, SerializationFeature.USE_EQUALITY_FOR_OBJECT_ID you can loosen checks so that all values that are deemed equal (by calling Object.equals()) are considered "same object"; basically this allows canonicalization of objects. It mostly makes sense when using ORM libraries, or other data sources that may not use exact object instances; or if you want to reconstruct shared references from sources that do not support it.

Second, when using YAML data format module, all Object Id references are now written using YAML native constructs called anchors, and references handled as native references. Same is also true for Type Ids; YAML module will now use tags for this purpose.
This change should not onlt make result more compact and "YAML-looking", but should also improve interoperability with native YAML tools. Latter should be most useful for a common use case for YAML, that of configuration files, where you can more easily share common configuration blocks with anchors and references.

5. Contextual Attributes for custom (de)serialization

One thing that has been missing so far from both SerializaterProvider and DeserializationContext objects (both of which extends DatabindContext base class) has been ability to assign and access temporary values during serialization/deserialization. In absence of something like this, custom JsonSerializer and JsonDeserializer implementations have had to use ThreadLocal to retain such values, adding more complexity.

But not any more: starting with 2.3.0, there is concept of "databind attributes" (similar to, say, Servlet attributes), managed and accessed using two simple methods -- getAttribute(Object key) and setAttribute(Object key, Object value). These can be used for keeping track of per-call (serialization or deserialization) state, pass data between (de)serializers and so on. All values are cleared when context is created (for a readValue() or writeValue() call); and similarly values are cleared at the end, so that no explicit clean up is required.

This may not sound like a huge feature in itself, but it actually opens up interesting possibilities for future: specifically, it may make sense to add new "standard attributes" that are set by databind module itself, when specific feature is enabled. For example, perhaps it would make sense to keep track of POJO that is being currently serialized/deserialized, to be accessible to actual (de)serializers.

6. Null handling for serialization

Another relative small, but often requested feature is ability to control how Java nulls are serialized, using more granular control than global rules like "all nulls to be serialized as empty Strings" (which is possible to do already). This is supported by adding a new property for @JsonSerialize annotation:


public class POJO {
@JsonSerialize(nullsUsing=MyNullSerializer.class)
public Value value;
}

in which case instance of MyNullSerializer would be used to write a JSON value for property "value", if POJO property has value null.

7. Other misc improvements

JAX-RS module has a new mechanisms for fully customizing ObjectReader and ObjectWriter instances, above and beyond what module itself can do. You can find more details on Issue#33.

XML module has been improved as well; it is finally possible to properly serialize root-level array and java.util.Collection values. ObjectMapper.convertValue() now works properly.

CSV module now supports filtering using JSON Views and JSON Filters; this was not working correctly with earlier versions.

Tuesday, September 10, 2013

Java Classmate 0.9 released -- getting ready for 1.0

UPDATE: (October 2013) -- Version 1.0 is now out!

----

Getting close to Java Classmate 1.0: version 0.9 now out!

Java Classmate is a highly specialized library that can be used to properly and completely resolve type information about Java generic types used in field and method signatures. This is information that is not, oddly enough, easily accessible via JDK -- if you try figure out, for example, what is actual return type for method 'IntegerProcessor.foo()' below:

  public class MyStringKeyMap<V> extends Map<String,V> { }
  public class Processor<T> {
    public T foo();
  }
  public class IntegerProcessor extends Processor<MyStringKeyMap<Integer>> { }

you are in for quite a ride. As a programmer, you should be able to figure out that it is equivalent to Map<String,Integer>. But dynamically, from your code, how would you figure it out? I won't bother you with all the complications: for more background feel free to read "Why 'java.lang.reflect.Type' does not cut it".
And my suggested solution is Java Classmate, as per "Use Java Classmate for resolving generic signatures".

Last change to test

Java Classmate library has been used by quite a few frameworks, such as Hibernate and JBoss, and considered for inclusion by others (Netty uses code adopted from JCM, to avoid adding external dependencies; Jackson uses "predecessor", and this is code from which JVM was originally built). There hasn't been much activity for past 12 months; no new bug reports, and things appear to Just Work, which is great.

So at this point version 0.9 is released, and I am hoping to get some last piece of feedback before releasing the "official" 1.0 version. Please let me know of any issues; the best way is via project's Github issue tracker.

Tuesday, August 13, 2013

On Jackson 2.2

Here's another thing I need to write about, from my "todo bloklog" (blog backlog): overview of Jackson 2.2 release.
As usual, official 2.2 release notes are worth checking out for more detailed listing.

1. Overview

Jackson 2.2.0 was released in April, 2013 -- that is, four months ago -- and the latest patch version currently available is 2.2.2. It has proven a nice, stable release, and is currently used by frameworks such as DropWizard (my current favorite Java service platform). This is also the current stable version, as the development for 2.3 is not yet complete.

As opposed to earlier releases (2.0 major, 2.1 minor) which overflowed with new functionality, focus with 2.2 was to really stabilize functionality and close as many open bugs as possible; especially ones related to new 2.x functionality.
Related to this, we wanted to improve parity (coverage of features to different parts; that is, that both serialization and deserialization support things; that Map/Node/POJO handling would be as similar as possible).

2. Enhancements to serializer, deserializer processing

One problem area, with respect to writing custom handlers for structured non-POJO types (esp. container types: arrays, Collections, Maps), was that BeanSerializerModifier and BeanDeserializerModifier handlers could only be used for POJO types.

But custom handling is needed for container types too; and especially so when adding support for third-party libraries like Trove, Guava and HPPC. 2.2 extended these interfaces to allow post-processing serializers/deserializers for all types (also including scalar types).

Ability to post-process (de)serializers of all types should reduce the need for writing custom (de)serializers from scratch: it is possible -- for example -- to take the default (de)serializer, and use post-processor to create (de)serializer that delegates to the standard version for certain cases, or for certain part of processing.

3. Converters

The biggest new feature was adding annotation-based support for adding things called "Converters". It can be seen as sort of further extension for the idea that one should be able to refine handling with small(er) component, instead of having to write custom handlers from scratch.

The basic idea is simple: to serialize custom types (that Jackson would not know how to handle correctly) one can write converters that know how to take a custom type and convert it into an intermediate object that Jackson already knows how to serialize. This intermediate form could be simple java.util.Map or JsonNode (tree model), or even just another more traditional POJO.

And for deserialization, do the reverse: let Jackson deserialize JSON into this intermediate type; and call converter to get to the custom type.

Typically you will write one or two converters (one if you just need converter for either serialization or deserialization; two if both); and then either annotate the type that needs converter(s); or property of that type that needs converter(s):


@JsonSerialize(converter=SerializationConverter.class)
@JsonDeserialize(converter=DeserializationConverter.class)
public class Point {
private int x, y;
public MyPoint(int x, int y) {
this.x = x;
this.y = y;
}
public int x() { return x; }
public int y() { return y; }
}

class SerializationConverter extends StdConverter<ConvertingBean, int[]> {
public int[] convert(MyPoint p) {
return new int[] { p.x(), p.y() };
}
}
// similarly for DeserializationConverter: StdConverter is convenient base class

This feature was inspired by similar concept in JAXB, and should have been included a long time ago (actually, 2.1 already added internal support for converters; 2.2 just added annotation and connected the dots).
One thing worth noting regarding above is that use of StdConverter is strongly recommended; although you may directly implement Converter there is usually little need. Also note that although example associated converters directly with the type, you can also add them to property definition; this can be useful when dealing with third-party types (although you can also use mix-in annotations for those cases).

4. Android

One "unusual" area for improvements was work to try to make Jackson run better on Android platform. Android has its set of quirks; and although Jackson was already working well from functionality perspective, there were obvious performance problems. This was especially true for data-binding, where initial startup overhead has been problematic.

One simple improvement was elimination of file VERSION.txt. While it seemed harmless enough thing for J2SE, Android's package loader has surprising overhead when loading resources from within a jar -- at least on some versions, contents of jar are retained in memory basically DOUBLING amount of memory needed. 2.2 replaced text-file based version discovery with simple class generation (as part of build, that is, static source generation).

Version 2.2 also contained significant amount of internal refactorings, to try to reduce startup overhead, by both simplifying set up of (de)serializers, and to try to improve lazy-loading aspects.
One challenge, however, is that we still do not have a good set of benchmarks to actually verify effects of these changes. So while the intent was to improve startup performance, we do not have solid numbers to report, yet.

On plus side, there is some on-going work to do more performance measurements; and I hope to write more about these efforts once related work is made public (it is not yet; I am not driving these efforts, but have helped).

5. JAX-RS: additional caching

Another area of performance improvements was that of JAX-RS provider. Earlier versions did reuse internal `ObjectMapper`, but had to do more per-call annotation processing. 2.2 added simple caching of results of annotation introspection, and should help reduce overhead.

One other important change was structural: before 2.2, there were multiple separate github projects (three; one for JSON, another for Smile, third for XML). With 2.2 we now have a single Github project, jackson-jaxrs-providers, with multiple Maven sub-projects that share code via a base package. This should simplify development, and reduce likelihood of getting cut'n paste errors.

6. AfterBurner becomes Production Ready

One more big milestone concerned Afterburner module (what is it? Check out this earlier entry). With a little help from my friends (special thanks to Steven Schlansker for his significant contributions!), all known issues were addressed and new checks added, such that we can now consider Afterburner production ready.

Given that use of Afterburner can give you 30-50% boost in throughput, when using data-binding, it might be good time to check it out.

Thursday, August 08, 2013

Brief History of Jackson the JSON processor

(Disclaimer: this article talks about Jackson JSON processor -- not other Jacksons, like American cities or presidents -- those others can be found from Wikipedia)

0. Background

It occurred to me that although it is almost six years since I released the first public version of Jackson, I have not actually written much about events surrounding Jackson development -- I have written about its features, usage, and other important things. But not that much about how it came about.

Since still remember fairly well how things worked out, and have secondary archives (like this blog, Maven/SVN/Github repositories) available for fact-checking the timeline, it seems like high time to write a short(ish) historical document on the most popular OSS project I have authored.

1. Beginning: first there was Streaming

Sometime in early 2007, I was working at Amazon.com, and had successfully used XML as the underying data format for couple of web services. This was partly due to having written Woodstox, a high-performance Java XML parser. I was actually relatively content with the way things worked with XML, and had learnt to appreciate benefits of open, standard, text-based data format (including developer-debuggability, interoperability and -- when done properly -- even simplicity).
But I had also been bitten a few times by XML data-binding solutions like JAXB; and was frustrated both by complexities of some tools, and by direction that XML-centric developers were taking, focusing unnecessarily in the format (XML) itself, instead of how to solve actual development problems.

So when I happened to read about JSON data format, I immediately saw potential benefits: the main one being that since it was a Data Format -- and not a (Textual) Markup Format (like XML) -- it should be much easier to convert between JSON and (Java) objects. And if that was simpler, perhaps tools could actually do more; offer more intuitive and powerful functionality, instead of fighting with complex monsters like XML Schema or (heaven forbid) lead devs to XSLT.
Other features of JSON that were claimed as benefits, like slightly more compact size (marginally so), or better readabilty (subjective) I didn't really consider particularly impresive.
Beyond appreciating good fit of JSON for web service use case, I figured that writing a simple streaming tokenizer and generator should be easy: after all, I had spent lots of time writing low-level components necessary for tokenizing content (I started writing Woodstox in late 2003, around time Stax API was finalized)

Turns out I was right: I got a streaming parser working and in about two weeks (and generator in less than a week). In a month I had things working well enough that the library could be used for something. And then it was ready to be released ("release early, release often"); and rest is history, as they say.

Another reason for writing Jackson, which I have occasionally mentioned, was what I saw as a sorry state of JSON tools -- my personal pet peeve was use of org.json's reference implementation. While it was fine as a proof-of-concept, I consider(ed) it a toy library, too simplistic, underpowered thing for "real" work. Other alternatives just seemed to short-change one aspect or another: I was especially surprised to find total lack of modularity (streaming vs higher levels) and scant support for true data-binding -- solutions tended to either assume unusual conventions or require lots of seemingly unnecessary code to be written. If I am to write code, I'd rather do it via efficient streaming interface; or if not, get a powerful and convenient data-binding. Not a half-assed XML-influenced tree model, which was en vogue (and sadly, often still is).

And the last thing regarding ancient history: the name. I actually do not remember story behind it -- obviously it is a play on JSON. And I vaguely recall toying with the idea of calling library "Jason", but deciding that might sound too creepy (I knew a few Jasons, and didn't want confusion). Compared to Woodstox -- where I actually remember that my friend Kirk P gave the idea (related to Snoopy's friend, bird named Woodstock!) -- I actually don't really know who to give credit to the idea, or inspiration to it.

2. With a FAST Streaming library...

Having written (and quickly published in August 2007) streaming-only version of Jackson, I spent some time optimizing and measuring things, as well as writing some code to see how convenient library is to use. But my initial thinking was to wrap things up relatively soon, and "let Someone Else write the Important Pieces". And by "important pieces" I mostly meant a data-binding layer; something like what JAXB and XMLBeans are to XML Streaming components (SAX/Stax).

The main reasons for my hesitation were two-fold: I thought that

  1. writing a data-binding library will be lots of work, even if JSON lends itself much more easily to doing that; and
  2. to do binding efficiently, I would have to use code-generation; Reflection API was "known" to be unbearably slow

Turns out that I was 50% right: data-binding has consumed vast majority of time I have spent with Jackson. But I was largely wrong with respect to Reflection. But more on that in a bit.

In short term (during summer and autumn of 2008) I did write "simple" data-binding, to bind Java Lists and Maps to/from token streams; and I also wrote a simple Tree Model, latter of which has been rewritten since then.

3. ... but No One Built It, So I did

Jackson the library did get relatively high level of publicity from early on. This was mostly due to my earlier work on Woodstox, and its adoption by all major second-generation Java SOAP stacks (CXF nee XFire; Axis 2). Given my reputation for producing fast parsers, generators, there was interest in using what I had written for JSON. But early adopters used things as is; and no one did (to my knowledge) try to build higher-level abstractions that I eagerly wanted to be written.

But that alone might not have been enough to push me to try my luck writing data-binding. What was needed was a development that made me irritated enough to dive in deep... and sure enough, something did emerge.

So what was the trigger? It was the idea of using XML APIs to process JSON (that is, use adapters to expose JSON content as if it was XML). While most developers who wrote such tools consider this to be a stop-gap solution to ease transition, many developers did not seem to know this.
I thought (and still think) that this is an OBVIOUSLY bad idea; and initially did not spend much time refuting merits of the idea -- why bother, as anyone should see the problem? I assumed that any sane Java developer would obviously see that "Format Impedance" -- difference between JSON's Object (or Frame) structure and XML Hierarchic model -- is a major obstacle, and would render use of JSON even MORE CUMBERSOME than using XML.

And yet I saw people suggesting use of tools like Jettison (JSON via Stax API), even integrating this into otherwise good frameworks (JAX-RS like Jersey). Madness!

Given that developers appeared intent ruining the good thing, I figured I need to show the Better Way; just talking about that would not be enough.
So, late in 2008, around time I moved on from Amazon, I started working on a first-class Java/JSON data-binding solution. This can be thought of as "real" start of Jackson as we know it today; bit over one year after the first release.

4. Start data-binding by writing Serialization side

The first Jackson version to contain real data-binding was 0.9.5, released December of 2008. Realizing that this was going to be a big undertaking, I first focused on simpler problem of serializing POJOs as JSON (that is, taking values of Java objects, writing equivalent JSON output).
Also, to make it likely that I actually complete the task, I decided to simply use Reflection "at first"; performance should really matter only once thing actually works. Besides, this way I would have some idea as to magnitude of the overhead: having written a fair bit of manual JSON handling code, it would be easy to compare performance of hand-written, and fully automated data-binder.

I think serializer took about a month to work to some degree, and a week or two to weed out bugs. The biggest surprise to me was that Reflection overhead actually was NOT all that big -- it seemed to add maybe 30-40% time; some of which might be due to other overhead beside Reflection access (Reflection is just used for dynamically calling get-methods or accessing field values). This was such a non-issue for the longest time, that it took multiple years for me to go back to the idea of generating accessor code (for curious, Afterburner Module is the extension that finally does this).

My decision to start with Serialization (without considering the other direction, deserialization) was good one for the project, I believe, but it did have one longer-term downside: much of the code between two parts was disjoint. Partly this was due to my then view that there are many use cases where only one side is needed -- for example, Java service only every writing JSON output, but not necessarily reading (simple query parameters and URL path go a long way). But big part was that I did not want to slow down writing of serialization by having to also consider challenges in deserialization.
And finally, I had some bad memories from JAXB, where requirements to have both getters AND setters was occasionally a pain-in-the-buttocks, for write-only use cases. I did not want to repeat mistakes of others.

Perhaps the biggest practical result of almost complete isolation between serialization and deserialization side was that sometimes annotations needed to be added in multiple places; like indicating both setter and getter what the JSON property name should be. Over time I realized that this was not a good things; but the problem itself was only resolved in Jackson 1.9, much later.

5. And wrap it up with Deserialization

After serialization (and resulting 0.9.5) release, I continued work with deserialization, and perhaps surprisingly finished it slightly faster than serialization. Or perhaps it is not that surprising; even without working on deserialization concepts earlier, I had nonetheless tackled many of issues I would need to solve, including that of using Reflection efficiently and conveniently; and that of resolving generic types (which is a hideously tricky problem in Java, as readers of my blog should know by now).

Result of this was 0.9.6 release in January 2009.

6. And then on to Writing Documentation

After managing to get the first fully functional version of data-binding available, I realized that the next blocker would be lack of documentation. So far I had blogged occasionally about Jackson usage; but for the most part I had relied on resourcefulness of the early adopters, those hard-working hardy pioneers of development. But if Jackson was to become the King of JSON on Java platform, I would need to do more for it users.

Looking blog at my blog archive I can see that some of the most important and most read articles on the site are from January of 2009. Beyond the obvious introductions to various operating modes (like "Method 2, Data Binding"), I am especially proud of "There are Three Ways to Process Json!" -- an article that I think is still relevant. And something I wish every Java JSON developer would read, even if they didn't necessarily agree with all of it. I am surprised how many developers blindly assume that one particular view -- often the Tree Model -- is the only mode in existence.

7. Trailblazing: finally getting to add Advanced Features

Up until version 1.0 (released May 2009), I don't consider my work to be particularly new or innovative: I was using good ideas from past implementations and my experience in building better parsers, generators, tree models and data binders. I felt Jackson was ahead of competition in both XML and JSON space; but perhaps the only truly advanced thing was that of generic type resolution, and even there, I had more to learn yet (eventually I wrote Java ClassMate, which I consider the first Java library to actually get generic type resolution right -- more so than Jackson itself).

This lack of truly new, advanced (from my point of view) features was mostly since there was so much to do, all the foundational code, implementing all basic and intermediate things that were (or should have been) expected from a Java data-binding library. I did have ideas, but in many cases had postponed those until I felt I had time to spare on "nice-to-have" things, or features that were more speculative and might not even work; either functionally, or with respect to developers finding them useful.

So at this point, I figured I would have the luxury of aiming higher; not just making a bit Better Mousetrap, but something that is... Something Else altogether. And with following 1.x versions, I started implementing things that I consider somewhat advanced, pushing the envelope a bit. I could talk or write for hours on various features; what follows is just a sampling. For slightly longer take, read my earlier "7 Killer Features of Jackson".

7.1 Support for JAXB annotations

With Jackson 1.1, I also started considering interoperability. And although I thought that compatibility with XML is a Bad Idea, when done at API level, I thought that certain aspects could be useful: specifically, ability to use (a subset of) JAXB annotations for customizing data-binding.

Since I did not think that JAXB annotations could suffice alone to cover all configuration needs, I had to figure a way for JAXB and Jackson annotations to co-exist. The result is concept of "Annotation Introspector", and it is something I am actually proud of: even if supporting JAXB annotations has been lots of work, and caused various frustrations (mostly as JAXB is XML-specific, and some concepts do not translate well), I think the mechanism used for isolating annotation access from rest of the code has worked very well. It is one area that I managed to design right the first time.

It is also worth mentioning that beyond ability to use alternative "annotation sets", Jackson's annotation handling logic has always been relatively advanced: for example, whereas standard JDK annotation handling does not support overriding (that is; annotations are not "inherited" from overridden methods), Jackson supports inheritance of Class, Method and even Constructor annotations. This has proven like a good decision, even if implementing it for 1.0 was lots of work.

7.2 Mix-in annotations

One of challenges with Java Annotations is the fact that one has to be able to modify classes that are annotated. Beyond requiring actual access to sources, this can also add unnecessary and unwanted dependencies from value classes to annotations; and in case of Jackson, these dependencies are in wrong direction, from design perspective.

But what if one could just loosely associate annotations, instead of having to forcible add them in classes? This was the thought exercise I had; and led to what I think was the first implementation in Java of "mix-in annotations". I am happy that 4 years since introduction (they were added in Jackson 1.2), mix-in annotations are one of most loved Jackson features; and something that I still consider innovative.

7.3 Polymorphic type support

One feature that I was hoping to avoid having to implement (kind of similar, in that sense, to data-binding itself) was support for one of core Object Serialization concepts (but not necessarily data-binding concept; data is not polymorphic, classes are): that of type metadata.
What I mean here is that given a single static (declared) type, one will still be able to deserialize instances of multiple types. The challenge is that when serializing things there is no problem -- type is available from instance being serialized -- but to deserialize properly, additional information is needed.

There are multiple problems in trying to support this with JSON: starting with obvious problem of JSON not having separation of data and metadata (with XML, for example, it is easy to "hide" metadata as attributes). But beyond this question, there are various alternatives for type identifiers (logical name or physical Java class?), as well as alternative inclusion mechanisms (additional property? What name? Or, use wrapper Array or Object).

I spent lots of time trying to figure out a system that would satisfy all the constraints I put; keep things easy to use, simple, and yet powerful and configurable enough.
It took multiple months to figure it all out; but in the end I was satisfied with my design. Polymorphic type handling was included in Jackson 1.5; less than one year after release of 1.0. And still most Java JSON libraries have no support at all for polymorphic types: or at most support fixed use of Java class name -- I know how much work it can be, but at least one could learn from existing implementations (which is more than I had)

7.4 No more monkey code -- Mr Bean can implement your classes

Of all the advanced features Jackson offers, this is my own personal favorite: and something I had actually hoped to tackle even before 1.0 release.

For full description, go ahead and read "Mr Bean aka Abstract Type Materialization"; but the basic idea is, once again, simple: why is it that even if you can define interface of your data type as a simple interface, you still need to write monkey to code around it? Other languages have solutions there; and some later Java Frameworks like Lombok have presented some alternatives. But I am still not aware of a general-purpose Java library for doing what Mr Bean does (NOTE: you CAN actually use Mr Bean outside of Jackson too!).

Mr Bean was included in Jackson 1.6 -- which was a release FULL of good, innovative new stuff. The reason it took such a long time for me to build was hesitation -- it is the first time I used Java bytecode generation. But after starting to write code I learnt that it was surprisingly easy to do; and I just wished I had started earlier.
Part of simplicity was due to the fact that literally the only thing to generate were accessors (setters and/or getters): everything else is handled by Jackson, by introspecting resulting class, without having to even know there is anything special about dynamically generated implementation class.

7.5 Binary JSON (Smile format)

Another important milestone with Jackson 1.6 was introduction of a (then-) new binary data format called Smile.

Smile was borne out of my frustration with all the hype surrounding Google's protobuf format: there was tons of hyperbole caused by the fact that Google was opening up the data format they were using internally. Protobuf itself is a simple and very reasonable binary data format, suitable for encoding datagrams used for RPC. I call it "best of 80s datagram technology"; not as an insult, but as a nod to maturity of the idea -- it is automating things that back in 80s (and perhaps earlier) were hand-coded whenever data communication was needed. Nothing wrong in there.

But my frustration had more to do with creeping aspects of pre-mature optimization; and the myopic view that binary formats were the only way to achieve acceptable performance for high-volume communication. I maintain that this is not true for general case.

At the same time, there are valid benefits from proper use of efficient binary encodings. And one approach that seemed attractive to me was that of using alternative physical encoding for representing existing logical data model. This idea is hardly new; and it had been demonstrated with XML, with BNUX, Fast Infoset and other approaches (all that predate later sad effort known as EXI). But so far this had not been tried with JSON -- sure, there is BSON, but it is not 1-to-1 mappable to JSON (despite what its name suggest), it is just another odd (and very verbose) binary format.
So I thought that I should be able to come up with a decent binary serialization format for JSON.

Timing for this effort was rather good, as I had joined Ning earlier that year, and had actual use case for Smile. At Ning Smile was dynamically used for some high-volume systems, such as log aggregation (think of systems like Kafka, Splunk). Smile turns out to work particularly well when coupled with ultra-fast compression like LZF (implemented at and for Ning as well!).

And beyond Ning, I had the fortune of working with creative genius(es) behind ElasticSearch; this was a match made in heaven, as they were just looking for an efficient binary format to complement their use of JSON as external data format.

And what about the name? I think I need to credit mr. Sunny Gleason on this; we brainstormed the idea, and it came about directly when we considered what "magic cookie" (first 4 bytes used to identify format) to use -- using a smiley seemed like a crazy enough idea to work. So Smile encoded data literally "Starts With a Smile!" (check it out!)

7.6 Modularity via Jackson Modules

One more major area of innovation with Jackson 1.x series was that of introduction of "Module" concept in Jackson 1.7. From design/architectural perspective, it is the most important change during Jackson development.

The background to modules was my realization that I neither can nor want to be the person trying to provide Jackson support for all useful Java libraries; for datatypes like Joda, or Collection types of Guava. But neither should users be left on their own, to have to write handlers for things that do not (and often, can not) work out of the box.

But if not me or users, who would do it? The answer of "someone else" does not sound great, until you actually think about it a bit. While I think that the ideal case is that the library maintainers (of Joda, Guava, etc) would do it, I think that the most likely case is that "someone with an itch" -- developer who happens to need JSON serialization of, say, Joda datetime types -- is the person who can add this support. The challenge, then, is that of co-operation: how could this work be turned to something reusable, modular... something that could essentially be released as a "mini-library" of its own?

This is where the simple interface known as Module comes in: it is simply just a way to package necessary implementations of Jackson handlers (serializers, deserializers, other components they rely on for interfacing with Jackson), and to register them with Jackson, without Jackson having any a priori knowledge of the extension in question. You can think them of Jackson equivalent of plug-ins.

8. Jackson 2.x

Although there were more 1.x releases after 1.6, all introducing important and interesting new features, focus during those releases started to move towards bigger challenges regarding development. It was also challenging to try to keep things backwards-compatible, as some earlier API design (and occasionally implementation) decisions proved to be sub-optimal. With this in mind, I started thinking about possibility of making bigger change, making a major, somewhat backwards-incompatible change.

The idea of 2.0 started maturing at around time of releasing Jackson 1.8; and so version 1.9 was designed with upcoming "bigger change" in mind. It turns out that future-proofing is hard, and I don't know how much all the planning helped. But I am glad that I thought through multiple possible scenarios regarding potential ways versioning could be handled.

The most important decision -- and one I think I did get right -- was to change the Java and Maven packages Jackson 2.x uses: it should be (and is!) possible to have both Jackson 1.x and Jackson 2.x implementations in classpath, without conflicts. I have to thank my friend Brian McCallister for this insight -- he convinced me that this is the only sane way to go. And he is right. The alternative of just using the same package name is akin to playing Russian Roulette: things MIGHT work, or might not work. But you are actually playing with code of other people; and they can't really be sure whether it will work for them without trying... and often find out too late if it doesn't.

So although it is more work all around for cases where things would have worked; it is definitely much, much less work and pain for cases where you would have had problems with backwards compatibility. In fact, amount of work is quite constant; and most changes are mechanical.

Jackson 2.0 took its time to complete; and was released February 2012.

9. Jackson goes XML, CSV, YAML... and more

One of biggest changes with Jackson 2.x has been the huge increase in number of Modules. Many of these handle specific datatype libraries, which is the original use case. Some modules implement new functionality; Mr Bean, for example, which was introduced in 1.6 was re-packaged as a Module in later releases.

But one of those Crazy Ideas ("what if...") that I had somewhere during 1.x development was to consider possibility of supporting data formats other than JSON.
It started with the obvious question of how to support Smile format; but that was relatively trivial (although it did need some changes to underlying system, to reduce deep coupling with physical JSON content). Adding Smile support lead me to realize that the only JSON-specific handling occurs at streaming API level: everything above this level only deals with Token Streams. So what if... we simply implemented alternative backends that can produce/consume token streams? Wouldn't this allow data-binding to be used on data formats like YAML, BSON and perhaps even XML?

Turns out it can, indeed -- and at this point, Jackson supports half a dozen data formats beyond JSON (see here); and more will be added over time.

10. What Next?

As of writing this entry I am working on Jackson 2.3; and list of possible things to work on is as long as ever. Once upon a time (around finalizing 1.0) I was under false impression that maybe I will be able to wrap up work in a release or two, and move on. But given how many feature-laden versions I have released since then, I no longer thing that Jackson will be "complete" any time soon.

I hope to write more about Jackson future ... in (near I hope) future. I hope above gave you more perspective on "where's Jackson been?"; and perhaps can hint at where it is going as well.

Saturday, August 03, 2013

Jackson 2.1 was released... quite a while ago :)

Ok, so I have not been an active blogger for a while. Like, since about a year ago. I am hoping to catch up a bit, so let's start with intermediate Jackson releases that have gone out the door since I last wrote about Jackson.

1. Jackson 2.1

Version 2.1 was released almost a year ago, October 2012. After big bang of 2.0 release -- what with all the crazy new features like Object Id handling (for fully cyclic object graphs), 2.1 was expected to be more minor release in every way.

But, that was not to be... instead, 2.1 packed an impressive set of improvements of its own.
Focus was on general usability: improved ergonomics, bit of performance improvements (for data-binding) and the usual array of bug fixes that required bigger changes in internals (and occasionally additional API) than what can be done in a patch release.

For more complete handling of what exactly was added, you can check out my Jackson 2.1 Overview presentation I gave at Wordnik (thanks Tony and folks!). Note that links to this and other presentations can be found from Jackson Docs github repo.
For full list of changes, check 2.1 Release Notes.

But here's a Reader's Digest version.

2. Shape-shifting

@JsonFormat annotation was added in Jackson 2.0, but was not used by many datatypes. With 2.1, there are interesting (and back then, experimental; but it is much more stable now!) new features to let you change the "shape" (JSON Structure) of some of common Java datatypes:

  • Serialize Enums as JSON Objects instead of Strings: useful for serialization, but can not deserialize back (how would that work? Enums are singletons)
  • Collections (Sets, Lists) as JSON Objects (instead of arrays): useful for custom Collections that add extra properties -- can also deserialize, with proper use of @JsonCreator annotations (or custom deserializer)
  • POJOs as Arrays! Instead of having name/value pairs, you will get JSON arrays where position indicates which property is being used (make sure to use @JsonPropertyOrder annotation to define ordering!)

Of these, the last option is probably the most interesting. It can make JSON as compact as CSV; and in fact can compete with binary formats in many cases, especially if values are mostly Strings.
A simple example would be:

  @JsonFormat(shape=JsonFormat.Shape.ARRAY)
  @JsonPropertyOrder(alphabetic=true)
  public class Point {
    public int x, y;
  }

which, when serialized, could look like:

  [ 1, 2]

instead of earlier

  { "x":1, "y":2 }

and obviously works for reading as well (that is, you can read such tabular data back).

3. Chunked (partial) Binary Data reads, writes

When dealing with really large data, granularity of JsonParser and JsonGenerator works well, except for case of long JSON Strings; for example, ones that contain Base64-encoded data. Since these values may be potentially very large, and since they are quite often just stored on disk (or read from disk to send) -- and there is no benefit from keeping the whole value in memory at all -- it makes sense to offer some way to allow streaming for values, not just between values.

To this end, JsonParser and JsonGenerator now do have methods that allow one to read and write large binary data chunks without retaining more than a limited amount of data in memory (one buffer-full, like 8 or 16kB) at any given point. Access is provided via java.io.InputStream and java.io.OutputStream, with methods:

JsonParser.readBinaryValue(OutputStream)
JsonGenerator.writeBinary(InputStream, int expectedLength)

Note that while direction of arguments may look odd, it actually makes sense when you try using it: you will provide handler for content (which implements OutputStream), and source for content to write (InputStream).

4. Format auto-detection support for data-binding

Another innovative new feature is ability to use already existing data format auto-detection, without having to use Streaming API. Earlier versions included support for JsonParser auto-detecting type of input, for data formats that support this (some binary formats do not; I consider this a flaw in such formats; of text formats, CSV does not): at least JSON, XML, Smile and YAML support auto-detection.

You enable support through ObjectReader for example like so:

  ObjectMapper mapper = new ObjectMapper();
  XmlMapper xmlMapper = new XmlMapper(); // XML is special: must start with its own mapper
  ObjectReader reader = mapper
    .reader(POJO.class) // for reading instances of POJO
    .withFormatDetection(new JsonFactory(), xmlMapper.getFactory(), new SmileFactory();

and then you can use resulting reader normally:

  User user = mapper.readValue(new File("input.raw"), User.class);

and input that is in XML, JSON or Smile format will be property decoded, and bound to resulting class. I personally use this to support transparent usage of Smile (binary JSON) format as a pluggable optimization over JSON.

5. Much improved XML module

Although XML module has existed since earlier 1.x versions, 2.0 provided first solid version. But it did not include support for one commonly used JAXB feature: ability to use so-called "unwrapped" Lists. 2.1 fixes this and fully supports both wrapped and unwrapped Lists.

But beyond this feature, testing was significantly extended, and a few specific bugs were fixed. As a result version 2.1 is the first version that I can fully recommend as replacement for JAXB processing in production environments.

6. Delegating serializer, deserializer

Final new feature is support for so-called delegating serializers and deserializers. The basic idea is simple: instead of having to build fully custom handlers you only need to implement converters that can convert your custom types into something that Jackson can automatically handle (supports out of the box).

Details of this are included in 2.1 presentation; most commonly you will just extend com.fasterxml.jackson.databind.deser.std.StdDelegatingDeserializer and com.fasterxml.jackson.databind.ser.std.StdDelegatingSerializer.

Thursday, May 24, 2012

Doing actual non-blocking, incremental HTTP access with async-http-client

Async-http-client library, originally developed at Ning (by Jean-Francois, Tom, Brian and maybe others and since then by quite a few others) has been around for a while now.
Its main selling point is the claim for better scalability compared to alternatives like Jakarta HTTP Client (this is not the only selling points: its API also seems more intuitive).

But although library itself is capable of working well in non-blocking mode, most examples (and probably most users) use it in plain old blocking mode; or at most use Future to simply defer handling of respoonses, but without handling content incrementally when it becomes available.

While this lack of documentation is bit unfortunate just in itself, the bigger problem is that most usage as done by sample code requires reading the whole response in memory.
This may not be a big deal for small responses, but in cases where response size is in megabytes, this often becomes problematic.

1. Blocking, fully in-memory usage

The usual (and potentially problematic) usage pattern is something like:

  AsyncHttpClient asyncHttpClient = new AsyncHttpClient();
  Future<Response> f = asyncHttpClient.prepareGet("http://www.ning.com/ ").execute();
  Response r = f.get();
byte[] contents = r.getResponseBodyAsBytes();

which gets the whole response as a byte array; no surprises there.

2. Use InputStream to avoid buffering the whole entity?

The first obvious work around attempt is to have a look at Response object, and notice that there is method "getResponseBodyAsStream()". This would seemingly allow one to read response, piece by piece, and process it incrementally, by (for example) writing it to a file.

Unfortunately, this method is just a facade, implemented like so:

 public InputStream getResponseBodyAsStream() {
return new ByteArrayInputStream(getResponseBodyAsBytes());
}

which actually is no more efficient than accessing the whole content as a byte array. :-/

(why is it implemented that way? Mostly because underlying non-blocking I/O library, like Netty or Grizzly, provides content using "push" style interface, which makes it very hard to support "pull" style abstractions like java.io.InputStream -- so it is not really AHC's fault, but rather a consequence of NIO/async style of I/O processing)

3. Go fully async

So what can we do to actually process large response payloads (or large PUT/POST request payloads, for that matter)?

To do that, it is necessary to use following callback abstractions:

  1. To handle response payloads (for HTTP GETs), we need to implement AsyncCompletionHandler interface.
  2. To handle PUT/POST request payloads, we need to implement BodyGenerator (which is used for creating a Body instance, abstraction for feeding content)

Let's have a look at what is needed for the first case.

(note: there are existing default implementations for some of the pieces -- but here I will show how to do it from ground up)

4. A simple download-a-file example

Let's start with a simple case of downloading a large file into a file, without keeping more than a small chunk in memory at any given time. This can be done as follows:


public class SimpleFileHandler implements AsyncHandler<File>
{
 private File file;
 private final FileOutputStream out;
 private boolean failed = false;

 public SimpleFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }
}

Voila. Code is not very brief (event-based code seldom is), and it could use some more handling for error cases.
But it should at least show the general processing flow -- nothing very complicated there, beyond basic state machine style operation.

5. Booooring. Anything more complicated?

Downloading a large file is something useful, but while not a contriver example, it is rather plain. So let's consider the case where we not only want to download a piece of content, but also want uncompress it, in one fell swoop. This serves as an example of additional processing we may want to do, in incremental/streaming fashion -- as an alternative to having to store an intermediate copy in a file, then uncompress to another file.

But before showing the code, however, it is necessary to explain why this is bit tricky.

First, remember that we can't really use InputStream-based processing here: all content we get is "pushed" to use (without our code ever blocking with input); whereas InputStream would want to push content itself, possibly blocking the thread.

Second: most decompressors present either InputStream-based abstraction, or uncompress-the-whole-thing interface: neither works for us, since we are getting incremental chunks; so to use either, we would first have to buffer the whole content. Which is what we are trying to avoid.

As luck would have it, however, Ning Compress package (version 0.9.4, specifically) just happens to have a push-style uncompressor interface (aptly named as "com.ning.compress.Uncompressor"); and two implementations:

  1. com.ning.compress.lzf.LZFUncompressor
  2. com.ning.compress.gzip.GZIPUncompressor (uses JDK native zlib under the hood)

So why is that fortunate? Because interface they expose is push style:

 public abstract class Uncompressor
 {
  public abstract void feedCompressedData(byte[] comp, int offset, int len) throws IOException;
  public abstract void complete() throws IOException;
}

and is thereby usable to our needs here. Especially when we use additional class called "UncompressorOutputStream", which makes an OutputStream out of Uncompressor and target stream (which is needed for efficient access to content AHC exposes via HttpResponseBodyPart)

6. Show me the code

Here goes:


public class UncompressingFileHandler implements AsyncHandler<File>,
   DataHandler // for Uncompressor
{
 private File file;
 private final OutputStream out;
 private boolean failed = false;
 private final UncompressorOutputStream uncompressingStream;

 public UncompressingFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   // if compressed, pass through uncompressing stream
   if (uncompressingStream != null) {
    part.writeTo(uncompressingStream);
   } else { // otherwise write directly
    part.writeTo(out);
   }
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (uncompressingStream != null) {
   uncompressingStream.close();
  }
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // must verify that we are getting compressed stuff here:
  String compression = h.getHeaders().getFirstValue("Content-Encoding");
  if (compression != null) {
   if ("lzf".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new LZFUncompressor(this));
   } else if ("gzip".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new GZIPUncompressor(this));
   }
  }
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }

 // DataHandler implementation for Uncompressor; called with uncompressed content:
 public void handleData(byte[] buffer, int offset, int len) throws IOException {
  out.write(buffer, offset, len);
 }
}

Handling gets bit more complicated here, since we have to handle both case where content is compressed; and case where it is not (since server is ultimately responsible for applying compression or not).

And to make call, you also need to indicate capability to accept compressed data. For example, we could define a helper method like:


public File download(String url) throws Exception
{
 AsyncHttpClient ahc = new AsyncHttpClient();
 Request req = ahc.prepareGet(url)
  .addHeader("Accept-Encoding", "lzf,gzip")
  .build();
 ListenableFuture<File> futurama = ahc.executeRequest(req,
new UncompressingFileHandler(new File("download.txt"))); try { // wait for 30 seconds to complete return futurama.get(30, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { throw new IOException("Failed to download due to timeout"); } }

which would use handler defined above.

7. Easy enough?

I hope above shows that while doing incremental, "streaming" processing is bit more work, it is not super difficult to do.

Not even when you have bit of pipelining to do, like uncompressing (or compressing) data on the fly.

Thursday, May 03, 2012

Jackson Data-binding: Did I mention it can do YAML as well?

Note: as useful earlier articles, consider reading "Jackson 2.0: CSV-compatible as well" and "Jackson 2.0: now with XML, too!"

1. Inspiration

Before jumping into the actual beef -- the new module -- I want to mention my inspiration for this extension: the Greatest New Thing to hit Java World Since JAX-RS called DropWizard.

For those who have not yet tried it out and are unaware of its Kung-Fu Panda like Awesomeness, please go and check it out. You won't be disappointed.

DropWizard is a sort of mini-framework that combines great Java libraries (I may be biased, as it does use Jackson), starting with trusty JAX-RS/Jetty8 combination, building with Jackson for JSON, jDBI for DB/JDBC/SQL, Java Validation API (impl from Hibernate project) for data validation, and logback for logging; adding bit of Jersey-client for client-building and optional FreeMarker plug-in for UI, all bundled up in a nice, modular and easily understandable packet.
Most importantly, it "Just Works" and comes with intuitive configuration and bootstrapping system. It also builds easily into a single deployable jar file that contains all the code you need, with just a bit of Maven setup; all of which is well documented. Oh, and the documentation is very accessible, accurate and up-to-date. All in all, a very rare combination of things -- and something that would give RoR and other "easier than Java" frameworks good run for their money, if hipsters ever decided to check out the best that Java has to offer.

The most relevant part here is the configuration system. Configuration can use either basic JSON or full YAML. And as I mentioned earlier, I am beginning to appreciate YAML for configuring things.

1.1. The Specific inspirational nugget: YAML converter

The way DropWizard uses YAML is to parse it using SnakeYAML library, then convert resulting document into JSON tree and then using Jackson for data binding. This is useful since it allows one to use full power of Jackson configuration including annotations and polymorphic type handling.

But this got me thinking -- given that the whole converter implementation about dozen lines or so (to work to degree needed for configs), wouldn't it make sense to add "full support" for YAML into Jackson family of plug-ins?

I thought it would.

2. And Then There Was One More Backend for Jackson

Turns out that implementation was, indeed, quite easy. I was able to improve certain things -- for example, module can use lower level API to keep performance bit better; and output side also works, not just reader -- but in a way, there isn't all that much to do since all module has to do is to convert YAML events into JSON events, and maybe help with some conversions.

Some of more advanced things include:

  • Format auto-detection works, thanks to "---" document prefix (that generator also produces by default)
  • Although YAML itself exposes all scalars as text (unless type hints are enabled, which adds more noise in content), module uses heuristics to make parser implementation bit more natural; so although data-binding can also coerce types, this should usually not be needed
  • Configuration includes settings to change output style, to allow use of more aesthetically pleasing output (for those who prefer "wiki look", for example)

At this point, functionality has been tested with a broad if shallow set of unit tests; but because data-binding used is 100% same as with JSON, testing is actually sufficient to use module for some work.

3. Usage? So boring I tell you

Oh. And you might be interested in knowing how to use the module. This is the boring part, since.... there isn't really much to it.

You just use "YAMLFactory" wherever you would normally use "JsonFactory"; and then under the hood you get "YAMLParser" and "YAMLGenerator" instances, instead of JSON equivalents. And then you either use parser/generator directly, or, more commonly, construct an "ObjectMapper" with "YAMLFactory" like so (code snippet itself is from test "SimpleParseTest.java")


  ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
User user = mapper.readValue("firstName: Billy\n"
+"lastName: Baggins\n"
+"gender: MALE\n"
+"userImage: AQIDBAY=",
User.class);


and to get the functionality itself, Maven dependency is:

<dependency>
  <groupId>com.fasterxml.jackson.dataformat</groupId>
  <artifactId>jackson-dataformat-yaml</artifactId>
  <version>2.0.0</version>
</dependency>

4. That's all Folks -- until you give us some Feedback!

That's it for now. I hope some of you will try out this new backend, and help us further make Jackson 2.0 the "Universal Java Data Processor"

Saturday, April 07, 2012

Java Type Erasure not a Total Loss -- use Java Classmate for resolving generic signatures

As I have written before ("Why 'java.lang.reflect.Type' Just Does Not Cut It"), Java's Type Erasure can be a royal PITA.

But things are actually not quite as bleak as one might think. But let's start with an actual somewhat unsolvable problem; and then proceed with another important, similar, yet solvable problem.

1. Actual Unsolvable problem: Java.util Collections

Here is piece of code that illustrates a problem that most Java developers either understand, or think they understand:

  List<String,Integer> stringsToInts = new ArrayList<String,Integer>();
List<byte[],Boolean> bytesToBools = new ArrayList<byte[], Boolean>();
assertSame(stringsToInts.getclass(), bytesToBools.getClass();

The problem is that although conceptually two collections seem to act different, at source code level, they are instances of the very same class (Java does not generate new classes for genericized types, unlike C++).

So while compiler helps in keeping typing straight, there is little runtime help to either enforce this, or allow other code to deduce expected type; there just isn't any difference from type perspective.

2. All Lost? Not at all

But let's look at another example. Starting with a simple interface


public interface Callable<IN, OUT> {
public OUT call(IN argument);
}

do you think following is true also?


public void compare(Callable<?,?> callable1, Callable<?,?> callable2) {
assertSame(callable1.getClass(), callable2.getClass());
}

Nope. Not necessarilly; classes may well be different. WTH?

The difference here is that since Callable is an interface (and you can not instantiate an interface), instances must be of some other type; and there is a good chance they are different.

But more importantly, if you use Java ClassMate library (more on this in just a bit), we can even figure out parameterization (unlike with earlier example, where all you could see is that parameters are "a subtype of java.lang.Object"), so for example we can do


// Assume 'callable1' was of type:
// class MyStringToIntList implements Callable<String, List<Integer>> { ... }
  TypeResolver resolver = new TypeResolver();
  ResolvedType type = resolver.resolve(callable1.getClass());
  List<ResolvedType> params = type.typeParametersFor(Callable.class);
// so we know it has 2 parameters; from above, 'String' and 'List<Integer>'
assertEquals(2, params.size()); assertSame(String.class, params.get(0).getErasedType();
// and second type is generic itself; in this case can directly access
ResolvedType resultType = params.get(1);
assertSame(List.class, resultType.getErasedType());
List<ResolvedType> listParams = resultType.getTypeParameters();
assertSame(Integer.class, listParams.get(0).getErasedType();
//or, just to see types visually, try:
String desc = type.getSignature(); // or 'getFullDescription'

How is THIS possible? (fun exercise: pick 5 of your favorite Java experts; ask if above is possible, observe how most of them would have said "nope, not a chance" :-) )

3. Long live generics -- hidden deep, deep within

Basically generic type information is actually stored in class definitions, in 3 places:

  1. When defining parent type information ("super type"); parameterization for base class and base interface(s) if any
  2. For generic field declarations
  3. For generic method declarations (return, parameter and exception types)

It is the first place where ClassMate finds its stuff. When resolving a Class, it will traverse the inheritance hierarchy, recomposing type parameterizations. This is a rather involved process, mostly due to type aliasing, ability for interfaces to use different signatures and so on. In fact, trying to do this manually first looks feasible, but if you try it via all wildcarding, you will soon realize why having a library do it for you is a nice thing...

So the important thing to learn is this: to retain run-time generic type information, you MUST pass concrete sub-types which resolve generic types via inheritance.

And this is where JDK collection types bring in the problem (wrt this particular issue): concerete types like ArrayList still take generic parameters; and this is why runtime instances do not have generic type available.

Another way to put this is that when using a subtype, say:


  MyStringList list = new ArrayList<String>() { }
// can use ClassMate now, a la:
ResolvedType type = resolver.resolve(list.getClass());
// type itself has no parameterization (concrete non-generic class); but it does implement List so: List<ResolvedType> params = type.typeParametersFor(List.class);
assertSame(String.class, params.get(0).getErasedType());

which once again would retain usable amount of generic type information.

4. Real world usage?

Above might seem as an academic exercise; but it is not. When designing typed APIs, many callbacks would actually benefit from proper generic typing. And of special interest are callbacks or handlers that need to do type conversions.

As an example, my favorite Database access library, jDBI, makes use of this functionality (using embedded ClassMate) to figure out data-binding information without requiring extra Class argument. That is, you could pass something like (not an actual code sample):

  MyPojo value = dbThingamabob.query(queryString, handler);

instead of what would more commonly requested:

  MyPojo value = dbThingamabob.query(queryString, handler, MyPojo.class);

and framework could still figure out what kind of thing 'handler' would handle, assuming it was a generic interface caller has to implement.

difference may seem minute, but this can actually help a lot by simplifying some aspects of type passing, and remove one particular mode of error.

5. More on ClassMate

Above actually barely scratch surface of what ClassMate provides. Although it is already tricky to find "simple" parameterization for main-level classes, there are much more trickier things. Specifically, resolving types of Fields and Methods (return types, parameters). Given classes like:

  public interface Base<T> {
    public T getStuff();
  }
  public class ListBase<T> implements Base<List<T>> {
protected T value;
protected ListBase(T v) { value = v; }
public T getstuff() { return value; }
} public class Actual implements ListBase<String> {
public Actual(List<String> value) { super(value; }
}

you might be interested in figuring out, exactly what is the type of return value of "getStuff()". By eyeballing, you know it should be "List<String>", but bytecode does not tell this -- in fact, it just tells it's "T", basically.

But with ClassMate you can resolve it:

  // start with ResolvedType; need MemberResolver
  ResolvedType classType = resolver.resolve(Actual.class);
MemberResolver mr = new MemberResolver(resolver);
ResolvedTypeWithMembers beanDesc = mr.resolve(classType, null, null);
ResolvedMethod[] members = bean.getMemberMethods();
ResolvedType returnType = null;
for (ResolvedMethod m : members) {
if ("getStuff".equals(m.getName())) {
returnType = m.getReturnType();
}
}
// so, we should get
assertSame(List.class, returnType.getErasedType());
ResolvedType elemType = returnType.getTypeParameters().get(0);
assertSame(String.class, elemType.getErasedType();

and get the information you need.

6. Why so complicated for nested types?

One thing that is obvious from code samples is that code that uses ClassMate is not as simple as one might hope. Handling of nested generic types, specifically, is bit verbose in some cases (specifically: when type we are resolving does not directly implement type we are interested in)
Why is that?

The reason is that there is a wide variety of interfaces that any class can (and often does) implement. Further, parameterizations may vary at different levels, due to co-variance (ability to override methods with more refined return types). This means that it is not practical to "just resolve it all" -- and even if this was done, it is not in general obvious what the "main type" would be. For these reasons, you need to manually request parameterization for specific generic classes and interfaces as you traverse type hierarchy: there is no other way to do it.

Friday, April 06, 2012

Notes on upgrading Jackson from 1.9 to 2.0

If you have existing code that uses Jackson version 1.x, and you would like to see how to upgrade to 2.0, there isn't much documentation around yet; although Jackson 2.0 release page does outline all the major changes that were made.

So let's try to see what kind of steps are typically needed (note: this is based on Jackson 2.0 upgrade experiences by @pamonrails -- thanks Pierre!)

0. Pre-requisite: start with 1.9

At this point, I assume code to upgrade works with Jackson 1.9, and does not use any deprecated interfaces (many methods and some classes were deprecated during course of 1.x; all deprecated things went away with 2.0). So if your code is using an older 1.x version, the first step is usually to upgrade to 1.9, as this simplifies later steps.

1. Update Maven / JAR dependencies

The first thing to do is to upgrade jars. Depending on your build system, you can either get jars from Jackson Download page, or update Maven dependencies. New Maven dependencies are:

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-annotations</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-core</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.0.0</version> </dependency>

The main thing to note is that instead of 2 jars ("core", "mapper"), there are now 3: former core has been split into separate "annotations" package and remaining "core"; latter contains streaming/incremental parser/generator components. And "databind" is a direct replacement of "mapper" jar.

Similarly, you will need to update dependencies to supporting jars like:

  • Mr Bean: com.fasterxml.jackson.module / jackson-module-mrbean
  • Smile binary JSON format: com.fasterxml.jackson.dataformat / jackson-dataformat-smile
  • JAX-RS JSON provider: com.fasterxml.jackson.jaxrs / jackson-jaxrs-json-provider
  • JAXB annotation support ("xc"): com.fasterxml.jackson.module / jackson-module-jaxb-annotations

these, and many many more extension modules have their own project pages under FasterXML Git repo.

2. Import statements

Since Jackson 2.0 code lives in Java packages, you will need to change import statements. Although most changes are mechanical, there isn't strict set of mappings.

The way I have done this is to simply use an IDE like Eclipse, and remove all invalid import statements; and then use Eclipse functionality to find new packages. Typical import changes include:

  • Core types: org.codehaus.jackson.JsonFactory/JsonParser/JsonGenerator -> com.fasterxml.jackson.core.JsonFactory/JsonParser/JsonGenerator
  • Databind types: org.codehaus.jackson.map.ObjectMapper -> com.fasterxml.jackson.databind.ObjectMapper
  • Standard annotations: org.codehaus.jackson.annotate.JsonProperty -> com.fasterxml.jackson.annotation.JsonProperty

It is often convenient to just use wildcards imports for main categories (com.fasterxml.jackson.core.*, com.fasterxml.jackson.databind.*, com.fasterxml.jackson.annotation.*)

3. SerializationConfig.Feature, DeserializationConfig.Feature

The next biggest change was that of refactoring on/off Features, formerly defined as inner Enums of SerializationConfig and DeserializationConfig classes. For 2.0, enums were moved to separate stand-alone enums:

  1. DeserializationFeature contains most of entries from former DeserializationConfig.Feature
  2. SerializationFeature contains most of entries from former SerializationConfig.Feature

Entries that were NOT moved along are ones that were shared by both, and instead were added into new MapperFeature enumeration, for example:

  • SerializationConfig.Feature.DEFAULT_VIEW_INCLUSION became MapperFeature.DEFAULT_VIEW_INCLUSION

4. Tree model method name changes (JsonNode)

Although many methods (and some classes) were renamed here and there, mostly these were one-offs. But one area where major naming changes were done was with Tree Model -- this because 1.x names were found to be rather unwieldy and unnecessarily verbose. So we decided that it would make sense to try to do a "big bang" name change with 2.0, to get to a clean(er) baseline.

Changes made were mostly of following types:

  • getXxxValue() changes to xxValue(): getTextValue() -> textValue(), getFieldNames() -> fieldNames() and so on.
  • getXxxAsYyy() changes to asYyy(): getValueAsText() -> asText()

5. Miscellaneous

Some classes were removed:

  • CustomSerializerFactory, CustomDeserializerFactory: should instead use Module (like SimpleModule) for adding custom serializers, deserializers

6. What else?

This is definitely an incomplete list. Please let me know what I missed, when you try upgrading!

Thursday, March 29, 2012

Jackson 2.0: CSV-compatible as well

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released"; or, for XML support, see "Not just for JSON any more -- also in XML")

Now that I talked about XML, it is good to follow up with another commonly used, if somewhat humble data format: Comma-Separated Values ("CSV" for friends and foes).

As you may have guessed... Jackson 2.0 supports CSV as well, via jackson-dataformat-csv project, hosted at GitHub

For attention-span-challenged individuals, checkout Project Page: it contains tutorial that can get you started right away.
For others, let's have a slight detour talking through design, so that additional components involved make some sense.

1. In the beginning there was a prototype

After completing Jackson 1.8, I got to one of my wishlist projects: that of being able to process CSV using Jackson. The reason for this is simple: while simplistic and under-specified, CSV is very commonly used for exchanging tabular datasets.
In fact, it (in variant forms, "pipe-delimited", "tab-delimited" etc) may well be the most widely used data format for things like Map/Reduce (Hadoop) jobs, analytics processing pipelines, and all kinds of scripting systems running on Unix.

2. Problem: not "self-describing"

One immediate challenge is that of lacking information on meaning of data, beyond basic division between rows and columns for data. Compared to JSON, for example, one neither necessarily knows which "property" a value is for, nor actual expected type of the value. All you might know is that row 6 has 12 values, expressed as Strings that look vaguely like numbers or booleans.

But then again, sometimes you do have name mapping as the first row of the document: if so, it represents column names. You still don't have datatype declarations but at least it is a start.

Ideally any library that supports CSV reading and writing should support different commonly used variations; from optional header line (mentioned above) to different separators (while name implies just comma, other characters are commonly used, such as tabs and pipe symbol) and possibly quoting/escaping mechanisms (some variants allow backslash escaping).
And finally, it would be nice to expose both "raw" sequence and high-level data-binding to/from POJOs, similar to how Jackson works with JSON.

3. So expose basic "Schema" abstraction

To unify different ways of defining mapping between property names and columns, Jackson now supports general concept of a Schema. While interface itself is little more than a tag interface (to make it possible to pass an opaque type-specific Schema instance through factories), data-format specific subtypes can and do extend functionality as appropriate.

In case of CSV, Schema (use of which is optional -- more on "raw" access later on) defines:

  1. Names of columns, in order -- this is mandatory
  2. Scalar datatypes columns have: these are coarse types, and this information is optional

Note that the reason that type information is strictly optional is that when it is missing, all data is exposed as Strings; and Jackson databinding has extensive set of standard coercions, meaning that things like numbers are conveniently converted as necessary. Specifying type information, then, can help in validating contents and possibly improving performance.

4. Constructing "CSV Schema" objects

How does one get access to these Schema objects? Two ways: build manually, or construct from a type (Class).

Let's start with latter, using same POJO type as with earlier XML example:


  public enum Gender { MALE, FEMALE };
  // Note: MUST ensure a stable ordering; either alphabetic, or explicit
  // (JDK does not guarantee order of properties)
  @JsonPropertyOrder({ "name", "gender", "verified", "image" })
   public class User {
   public Gender gender;
   public String name;
   public boolean verified;
   public byte[] image;
  }
// note: we could use std ObjectMapper; but CsvMapper has convenience methods CsvMapper mapper = new CsvMapper(); CsvSchema schema = mapper.schemaFor(User.class);

or, if we wanted to do this manually, we would do (omitting types, for now):


  CsvSchema schema = CsvSchema.builder()
.addColumn("name") .addColumn("gender")
.addColumn("verified")
.addColumn("image")
.build();

And there is, in fact, the third source: reading it from the header line. I will leave that as an exercise for readers (check the project home page).

Usage is identical, regardless of the source. Schemas can be used for both reading and writing; for writing they are only mandatory if output of the header line is requested.

5. And databinding we go!

Let's consider the case of reading CSV data from file called "Users.csv", entry by entry. Further, we assume there is no header row to use or skip (if there is, the first entry would be bound from that -- there is no way for parser auto-detect a header row, since its structure is no different from rest of data).

One way to do this would be:


  MappingIterator<Entry> it = mapper
.reader(User.class)
.with(schema)
.readValues(new File("Users.csv"());
List<User> users = new ArrayList<User>();
while (it.hasNextValue()) {
User user = it.nextValue();
// do something?
list.add(user);
}
// done! (FileReader gets closed when we hit the end etc)

Assuming we wanted instead to write CSV, we would use something like this. Note that here we DO want to add the explicit header line for fun:


  // let's force use of Unix linefeeds:
ObjectWriter writer = mapper
.writer(schema.withLineSeparator("\n"));
writer.writeValue(new File("ModifiedUsers.csv"), users);

one feature that we took advantage of here is that CSV generator basically ignores any and all array markers; meaning that there is no difference whether we try writing an array, List or just basic sequence of objects.

6. Data-binding (POJOs) vs "Raw" access

Although full data binding is convenient, sometimes we might just want to deal with a sequence of arrays with String values. You can think of this as an alternative to "JSON Tree Model"; an untyped primitive but very flexible data structure.

All you really have to do is to omit definition of the schema (which will then change observe token sequence); and make sure not to enable handling of header line
For this, code to use (for reading) looks something like:


  CsvMapper mapper = new CsvMapper();
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues( "1,null\nfoobar\n7,true\n");
Object[] data = it.nextValue();
assertEquals(2, data.length);
// since we have no schema, everything exposed as Strings, really
assertEquals("1", data[0]);
assertEquals("null", data[1]);

Finally, note that use of raw entries is the only way to deal with data that has arbitrary number of columns (unless you just want to add maximum number of bogus columns -- it is ok to have less data than columns).

7. Sequences vs Arrays

One potential inconvenience with access is that by default CSV is exposed as a sequence of "JSON" Objects. This works if you want to read entries one by one.

But you can also configure parser to expose data as an Array of Objects, to make it convenient to read all the data as a Java array or Collection (as mentioned earlier, this is NOT required when writing data, as array markers have no effect on generation).

I will not go into details, beyond pointing out that the configuration to enable addition "virtual array wrapper" is:


mapper.ensable(CsvParser.Feature.WRAP_AS_ARRAY);

and after this you can bind entries as if they came in as an array: both "raw" ones (Object[][]) and typed (List<User> and so on).

8. Limitations

Compared to JSON, CSV is more limited data format. So does this limit usage of Jackson CSV reader?

Yes. The main limitation is that column values need to essentially be scalar values (strings, numbers, booleans). If you do need more structured types, you will need to work around this, usually by adding custom serializers and deserializers: these can then convert structured types into scalar values and back. However, if you end up doing lots of this kind of work, you may consider whether CSV is the right format for you.

9. Test Drive!

As with all the other JSON alternatives, CSV extension is really looking forward to more users! Let us know how things work.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.