Sunday, September 19, 2010

Every day Jackson usage, part 2: Open Content

1. Open content vs Closed content

In content modeling, closed content models are ones where everything that cam be included in piece of content is fully defined, usually using static schema of some sort. Open content differs in that it only defines subset of possible content, one that is recognized, but where additional content may exist; and it is possible to also have multiple complementary content models to describe a single document. There are degrees in between; for example, it may be possible to "open up" content models by allowing some amount of content to be added on specified extension points. Open models are also useful in cases where currently used model is static (and could be described with a closed model), but where it is reasonable to expect that model may need to evolve, to be extended with additional content items.

Object models of programming languages vary from closed to open models. Java's object models is very much a closed one, for example, defined by static class definitions; while one can extend classes, instances can not be dynamically extended. On the other end of spectrum, content models of scripting languages like Javascript are open: in Javascript, you can add new properties and even functions on the fly. Content typing for more dynamic languages is sometimes called "duck typing" ("if it walks like a duck, quacks like a duck, is most likely a duck...") as opposed to traditional "static typing" (or even "dynamic typing", which just changes applicability but does not lift existence of strong typing).

When modelling data content expressed using data formats like JSON, both options are possible. Some tools only support one kind: for example, DOM API for XML supports open content model; and data binding libraries like JAXB typically support only (or mostly) closed content models derived from Java class or XML Schema definitions. Combinations are also possible: although XML and DOM allow any kind of content, one can use XML Schema as kind of straight-jacket to force a closed (or mostly closed) content model on otherwise loose data.

2. Jackson: open or closed? Both!

Jackson has traditionally support both general content model types: low-level streaming interface is content agnostic (and could be viewed as very much open model); Tree Model is for handling open content; and Data Binding supports handling of closed content, with limited functionality to support some aspects of open content models. Before having a closed looked what Jackson has to offer, let's first consider need why used Open Content may be necessary.

3. Need for Open Content

A rule of thumb regarding code and data is that data typically has much longer life-cycle than code. A crude analogy could be that whereas there are books that are hundreds years old, there are few printing presses of some age in existence. Same will be true for all kinds of documents, from papyrus to microfiche and web pages. As such, life cycle of content models -- logical constructs that define structure and content of documents -- will be potentially very long. And due to needs for backwards compatibility, content models must evolve, to accomodate for new kinds of content that co-exists with old content.

This need for evolving data models is an obvious reason why closed content models are problematic: to use a closed content model, one has to believe that the official (first) definition is basically complete (so no changes are needed); or alternatively that one has power over all systems that handle said content (to allow for changing those systems when model changes). Needless to say, few have (or should have) faith in having either of these powers.

Open content, then, is based on assumption that over time new content will be added. This does not mean haphazard changes, but rather changes that extend model by creating supersets: old model is a subset of extended version. There are actually other interesting reasons for using open content models -- for example, when multiple "facets" of an item are expressed in a single document, it may be useful to have multiple complementary partial content models; and considering them open models allows peaceful co-existence -- but for this article, it is enough to focus on particular issue of data evolution.

So where is the catch? A significant problem with open content models is that of handling open content documents by applications: given that model is likely to evolve over time (or that content is only partially described by model, due to other parts of data being irrelevant for specific application), applications need to be mindful not to assume immutability or completeness of model itself. And this is where we can go back to Jackson. :-)

4. Jackson Data Binding: simplest tools, ignoring unknown properties

Jackson 1.5 and prior allowed rudimentary ways to allow "unspecified content". For the most part it consisted of ability to specify that unrecognized ("unknown") properties can either be ignored, or handled using custom handlers. See "Ignoring Unknown Properties" for more details of how this can be done.

But while it is useful to be able to prevent processing errors when content models are extended, it has one unfortunate effect: any such unrecognized content is promptly dropped, and will not be serialized if content is to be written out as JSON again. This is harmful both for cases where new content structures are added, and where "irrelevant" parts of documents are very much relevant for other systems that process content.

5. Jackson Data Binding: handling of "any content"

Jackson 1.6 addresses issue by adding one new method for handling content extensions, which allows for better dealing with Open Content: ability to define both "any setter" and "any getter". Consider example POJO (from "@JsonAnyGetter" wiki page):

  public class ExtensibleBean
public String name; // we always have name

private void HashMap<String, String> properties;

public ExtensibleBean() { }

@JsonAnySetter public void add(String key, String value) { // name does not matter
  properties.put(key, value);

@JsonAnyGetter public Map<String,String> properties() { // note: for 1.6.0 MUST use non-getter name; otherwise doesn't matter
  return properties;

The idea is that basically you can have fully-defined properties (here just "name"), handled using (partial) Closed Model. But you can also specify that there is per-type general handler for anything that is not recognized. A typical handler would just use Java Map to store such values, using native simple binding (of Maps, Lists, wrappers). And not only store them (deserialized using @JsonAnySetter, which already existed since version 1.1), but also serialize them back as JSON using @JsonAnyGetter.

Ok: but given that any explicit handling of these extensions requires new code -- after all, how could you write code for handling things you don't even recognize? -- how does this help?

It helps because this way extended data can be passed through to systems that may know how to handle it. So we will avoid both

  • failing on unrecognized content, or worse
  • dropping unrecognized content silently

both of which would have nasty consequences for later versions; and latter being especially nasty as it can essentially corrupt data. And as data is typically much more valuable than code, one should rather take a processing error (which indicates a problem to fix) rather than data corruption which will only bite later, and cause much more pain.

6. Jackson Data Binding: handling of content as "untyped"

There is actually another way to deal with unknown content, in cases where extension points are defined in advance: since it is possible to use loose types like:

  • java.lang.Object: all JSON constructs can be mapped to specific matching "native" Java type: JSON booleans become java.lang.Boolean, JSON Strings Java Strings, JSON Arrays as Java Lists and JSON Objects as Map
  • org.codehaus.jackson.JsonNode: all JSON constructs can also be expressed as JSON trees, as per Jackson's Tree Model. This can also be considered "untyped" binding

it is then possible to just declare that a JSON property just maps to a Java object without knowing anything about structure of the type. The main downside is that you still need to know specific properties that are of such unknown (or loose) type. Using loose typing has others uses as well; it is commonly used for multi-step type conversions, where type may only be known at a later point (possibly as a result of data made available by initial data binding; or by a later processing stage that takes initially bound data as its input).

7. Other related work

More generally, approach taken with many processing libraries is to allow filtering out content that is not recognized. One example is StaxMate (XML processing library), which implements cursors that only show content that is specified: you get what you want to get (or "see what you want to see"); and anything else can be ignored.
This works reasonably well for read-only use cases.

Other data binding frameworks also support similar notions of foreign content: JAXB, for example, has @XmlAny annotation to denote content of unknown type; as well as ability to bind sub-trees into generic "untyped" content (express as type-agnostic DOM elements).

UPDATE, 05-October-2010: I just learnt that 1.6.0 release contained a small but annoying bug which meant that if @JsonAnyGetter method had valid getter signature (getXxx() with no arguments), properties would be duplicated. This is fixed for 1.6.1, but if using 1.6.0, make sure name is not 'getXxx()' to avoid problems.

Saturday, September 18, 2010

Jackson 1.6, improved Tree Model access; finding values of all instances of specified property

One class of improvements included in Jackson 1.6 are improvements to Tree Model accessors, added to JsonNode.

So, for example, if you have an array of User objects:

  { "name" : "Billy", "id" : 123 },
  { "name" : "Bob", "id" : 456, "extraInfo" : { "flags" : 3 } } // and so on   

and you would just like to get list of names of all users, formerly you needed to write piece of glue code to get names, build a List and so on. While simple to do, it is boring and ultimately error-prone thing to have to do.

So new functionality that was added can simplify such tasks:

  JsonNode root = objectMapper.readTree(json);
  List<JsonNode> nameNodes = root.findValues("name");
  // or, given we know they are Strings, we can get actual name values directly:
  List<string> nameStrings = root.findValuesAsText("name");

Or, if you actually wanted to just find Users that have property "extraInfo" defined, you would do:

  List<JsonNode> nodes = root.findParents("extraInfo");

Granted, these find methods are not superbly powerful -- hopefully we can work on actual JSON path expressions in future -- but they help a bit for common cases.
And even without formal expression language, perhaps we could add filter-based alternatives for more concise recursive-descent lookups (findValues(new NodeFilter() { .... } ) ?)

Friday, September 17, 2010

XML Schema: case of a "simple" element with text and one attribute

It has been a while since I worked with XML Schemas. But after immersing myself in delicious XML Schema material (to revamp Woodstox XML Schema support), I realized that it has not been nearly enough time.

That XML Schema is a very complex way of doing not so much more than what DTDs allow (basically, main tangible useful thing you get is plaing nice with namespaces -- necessary, but not much to write home about) is not news, but rather a recognized fact. But just how complicated AND verbose it is isn't obvious when you do not have write schemas. So here is the very first thing I had to do again: write a schema for single element with text as content, and a single text attribute. How much xml would that be? One or two lines?

No such luck. Unless I am mistaken, here's how to do it (adapted from Eric van der List's XML Schema book, highly recommended):

<xs:schema xmlns:xs=''>
 <xs:element name='price'>
     <xs:extension base='xs:int'>
       <xs:attribute name='currency' type='xs:string' />

God have mercy on us... this makes Java look like Forth in comparison. What were they thinking?

Sunday, September 12, 2010

Lazy Coder Cuisine: Making Ciabatta -- lots of work, tasty rewards

Here is something different within culinary domain. I love good bread, and at times it has been challenging to find good stuff in this continent. But while I have not located all kinds of good breads that exist in northern parts of the old continent, I have found new good things over here. One of those things is ciabatta, which is actually a fairly recent invention, despite coming from established culinary superpower of Italy.

Anyway: while it is possible to find good ciabatta here in Seattle area, it can be bit pricey. And beside economy, freshness is still a challenge. So I figured that perhaps it might not be beyond my capabilities (or rather, team Saloranta's capabilities -- my wife is a superior chef these days) to bake a decent load of ciabatta.

Ok: we settled for about the first Google hit, and specifically on this recipe. Reading through it, it became clear that this is not one of Lazy Coder's recipes; process is surprisingly lengthy for bread making (granted, variance for kinds of breads is huge), if nothing extra-ordinary. Specifically it's actually possible to get it all done within a day.

Long story short: following recipe closely, and using a 20$ pizza stone from Bed Bath & Beyond (pre-heating for bit over suggested 45 minutes) produced 2 rather tasty instances. Even 2 steps that felt tricky (kneading by hand -- I suspect that time mentioned is way overestimate, 2 minutes is probably fine -- maybe machines have less torque or something; and getting the thing on pizza stone from parchment paper, should have added flour on parchment) failed to ruin the results. So two thumbs up for the recipe and the idea, as long as you have time and energy to do it.

One thing I would suggest for anyone who tries this is to just double the dose and make 4 Ciabattas: effort is not doubled, and so if you go through the trouble, might as well get enough to eat for a while, or to offer for good friends.

And lastly: one nice thing about Ciabatta is that it is delicious with very little else: for example, with melted butter. Yum.

ps. On somewhat related note; another late pleasant finding was that US "genoa salami" is close enough to old world "metwurst" (almost as close as "gypsy salami" which is the best US match so far). Couple this with good cheese (sharp cheddar, emmentaler [aka "swiss cheese"], gouda), and you get pretty close to perfect "default topping" for all kinds of breads -- not just dark sourdough ryebread (finnish specialty), but all kinds of white wheat-based breads too.

Tuesday, September 07, 2010

Jackson 1.6 released

After almost 6 months of development, Jackson 1.6 was finally released last night (download responsibly!). Despite preliminary plans of creating a somewhat smaller incremental version after the big bang of 1.5, over time things changed and we actually have another biggie-size increment at our hands. But whereas 1.5 was focused on implementing a complete solution for just one big hairy problem (handling of polymorphic types), 1.6 is a full frontal assault against remaining hard-to-handle use cases. It both expands set of use cases that Jackson can handle and improves support for existing use cases making usage even more convenient and performant.

For the full list of features, check out FasterXML Jackson 1.6 features page and 1.6.0 release notes. But here is an overview of most notable changes.

1. Structural changes: 2 new optional jars

At surface level, one obvious thing is that there are now 2 more optional jars you can include. They contain new functionality known as "Mr Bean" and "Project Smile"; more on these in a moment. There is also an addition of "jackson-all" jar which simply contains everything from all the other jars; to be used when you just don't want n+1 separate jars around and would rather have a single fat jar for all Jackson stuff.

Otherwise packaging remains the same; and backwards compatibility works as expected for a "minor" release -- that is, code written for earlier 1.x versions should work as is. Considering scope of changes, upgrade from versions 1.4 and 1.5 specifically should be very safe.

2. Shiny New Things: Big 4 of 1.6

There are multiple ways to group changes and improvements. Let's start with what I view as 4 major new features:

  • ObjectMapper.updateValue(): ability to merge changes, deltas
  • Automatic Parent/Child reference handling: better support for Object/Relational Mapper (ORM) values (Hibernate, iBatis)
  • Interface/Abstract class Materialization ("Mr Bean"): give Jackson your interfaces, forget about boiler plate classes
  • JSON-compatible high-performance binary data format ("Project Smile"): even more performance without sacrificing convenience of schema-free data model

2.1 ObjectMapper.updateValue()

I actually wrote 'New feature: ability to "update" beans, not just recreate' a while ago, since this was the first new thing implemented after 1.5. The idea is that you can now optionally provide an existing object ("root value") when deserializing, and ObjectMapper can just update its properties, instead of instantiating a new object. This is useful when merging properties, for example by using default values and overrides, possibly with multiple levels of priorities, or when loading settings from multiple sources.

Usage is as simple as:

  Properties properties = new Properties();
  ObjectMapper mapper = new ObjectMapper();
  // can call multiple times if you want to merge multiple sets of values:

method ObjectMapper.updatingReader() creates a Reader of type ObjectReader, which can be further configured (this also reduces number of methods that need to be added to ObjectMapper itself). Object to update can be any type supported by Jackson's regular ObjectMapper.readValue().

2.2 Parent/Child reference handling

One thing that has been problematic for serialization is linkage between parent and child objects for trees and ORM-mapped classes (or for simple double-linked lists). Problem is that without special handling this cyclic dependency causes serialization failure. Prior to 1.6 the way to handle this problem has been to suppress serialization of one of links (usually the "back link" from child object to parent), and lose back link; or to write custom serializers and deserializers. Jackson 1.6 offers a better way: use of 2 new annotations, @JsonManagedReference and @JsonBackReference. Consider an example of a two classes:

public class Root {
  public Leaf[] leaves; // works for simple POJOs, arrays, Lists, Maps etc

public class Leaf {
  public Root root;

  public String id;

serialization of a Root object with 2 Leaf objects would produce something like:

  "leaves" : [
    { "id" : "leaf1Id" },
    { "id" : "leaf2Id" }

which is similar to just using @JsonIgnore on 'public Root id;' field. But the real trick is with deserialization, which will automatically set 'root' field to point to deserialized instance, as if that link was serialized.

So behavior is:

  • @JsonManagedReference will be used as marker for something that points to corresponding @JsonBackReference, used when deserializing; does not change serialization
  • @JsonBackReference will suppress serialization, allow re-constructing reference on deserialization

About only additional feature is that in case there are multiple link references, it is possible to explicitly define id to use for matching managed/back reference pairs. Note, too, that it is possible to use self-references; this would be needed for nodes of doubly-linked lists for example.

This feature is most useful for ORM beans, handling one-to-one and one-to-many references, but is also useful for some cyclic data structures.

2.3 Mr Bean: Let Jackson do Monkey Coding

One of more boring and mundane tasks with Java has traditionally been the requirement to fully write out basic value holding Beans: adding fields as well as getters and setters. Although Jackson has made it possible to reduce need for such boilerplate code (for example, eliminate need for setters by annotating private fields, or by using @JsonCreator annotated constructors -- see the previous article about Jackson with Immutable Objects!), more could still be done.

And Jackson 1.6 does that something. Consider following piece of code:

public interface Bean { // or could be an abstract class
  public String getName();
  public int getAge();

ObjectMapper mapper = new ObjectMapper();
// org.codehaus.jackson.mrbean.AbstractTypeMaterializer, extends
mapper.getDeserializationConfig().setAbstractTypeResolver(new AbstractTypeMaterializer());
Bean value = mapper.readValue("{\"name\" : \"Billy\", \"age\" : 28 }");

With earlier versions of Jackson (and with any other Java data binder), what you would most likely get is an exception indicating problem of not being able to create an instance of an interface. But thanks to that AbstractTypeMaterializer thing, you can now let Jackson materialize bean classes and relax.
Just remember to include that new "jackson-mrbean-1.6.0.jar" (or Maven dependency) and you are good to go. Pretty neat eh?

Bean materialization works for simple interfaces and abstract classes: methods recognized as setters and getters are implemented; other methods either cause a failure, or can optionally be made to be implemented as error-throwing placeholders. Appropriate setters are created if there are getters (which is needed for cases like above) but if you want to modify values yourself, you can also add explicit setter signatures. You can also use all the usual Jackson annotations for configuration: since type materializer is only concerned with creating classes, and does NOT handle actual serialization or deserialization, standard Jackson ObjectMapper will use them as before. Ability to define abstract classes could be especially useful in cases where you want to control specific aspects or properties, but leave simple properties to Mr Bean.

I hope to write some more about Mr Bean in another article. And I would especially appreciate feedback from users -- this has been the number one missing feature (as per my own priorization) for almost 2 years now, and I expect it to be a big hit, comparable to effect of mix-in annotations.

Finally, big Thank You to Sunny G who contributed the initial version of Mr Bean, so it could be included in 1.6.

2.4 Even More Extreme Performance: Project Smile, JSON-compatible binary format

And last but not least, another longer-term project that I have wanted to do for a while is defining an efficient and 100% JSON compatible binary format, similar to how various Binary XML formats have tackled high-performance XML use cases. Although there have been prior attempts at doing this (like BSON), none have been fully JSON compatible and performant (BSON for example is neither super nor subset of JSON). Yet others insist on having to specify rigid schema to use (Thrift, Protocol Buffers).

Project Smile tackled this challenge, and produced what we hope to be a very compelling binary data format, as well as full support for using that data format exactly as one uses JSON. Sort of like just having different representation of JSON. For those interested in low-level details, feel free to check out Smile Data format specification (and specifically if anyone is interested in implementing Smile support on other platforms, PLEASE check it out!).

To use Smile, all you need is to instantiate (from jackson-smile-1.6.0.jar) -- which extends standard org.codehaus.jackson.JsonFactory -- and use it as is (to create SmileGenerators and SmileParsers; respectively extending JsonGenerator and JsonParser), or via ObjectMapper (construct ObjectMapper with SmileFactory). All the usual functionality should work as is, including streaming parsing and generation, full data binding and Tree Model access.

Obvious and measurable benefits include:

  • More compact data -- especially so for larger and more repetitive data, such as rows from database or entries for Map/Reduce tasks.
  • Event faster parsing, and possibly faster generation (one of design criteria was that generation speed is not sacrificed for parsing speed or data size -- and hence Smile is one of fastest binary data formats to write)

I hope to update Thrift-protobuf performance benchmark with Smile-based test results in near future: based on measurements I have done so far (using locally modified version), Smile is typically 25-50% faster and produces 25-50% more compact data than Jackson with textual JSON. This makes it generally faster than Thrift or Avro on Java (which are often no faster than textual JSON with Jackson), and comparable in speed to Protocol Buffers -- and all this without sacrificing any of Jackson flexibility or expressive power.

I am specifically hoping to show how Smile would be a good alternative to Avro for large-scale date processing; using optionally enabled property name and String value back references, data size can be compact enough to render schemas unnecessary; and turbo-charged Jackson parsing and generation keep data flowing at wire speeds.

I will definitely write some more about Smile in future so stay tuned.

3. Other significant areas of improvement

Beyond "big four", 1.6 includes numerous improvements and fixes (release notes include 39 resolved Jira issues, mostly improvements and new features).

3.1 Enum value handling, customization

Handling of enum values has been somewhat lacking prior to version 1.6. With 1.6 it is finally possible to simply define that Enum.toString() is to be used as serialization value (instead of using SerializationConfig.Feature.WRITE_ENUMS_USING_TO_STRING (and matching DeserializationConfig.Feature.READ_ENUMS_USING_TO_STRING ). Or for serialization, define serialization by existing @JsonValue annotation that is now supported for Enum types; obvious case being to annotate 'toString()' with @JsonValue.

It is also possible to define "Creator" methods (aka factories) using @JsonCreator annotation (constructors can not be used with Enums).

3.2 More convenient Tree Model

There are numerous additions to Tree Model API (org.codehaus.jackson.JsonNode), such as:

  • coercion of numeric types (JsonNode.getValueAsInt() and other variants) to convert JSON String values
  • JsonNode.has(String fieldName) for checking existence of a property
  • set of findXxx() methods: JsonNode.findParent(), findParents(); findPath(), findValue(), findValueAsText() (check out Jackson Javadocs for details)

which should simplify common Tree traversal tasks a lot. I probably should write bit more about these methods in future.

3.3 Serialization performance improvements

Although performance has always been a strong point of Jackson, there was room for improvement on serialization side. With some tweaks, serialization speed was increased by an average of 20% (as per test). No configuration changes are needed beyond upgrade to 1.6.

3.4 Allow registration of sub-types for Polymorphic Handling, without annotations

It is now possible to register subtypes for deserialization, instead of having to use @JsonSubTypes annotation -- this was number one request for improving polymorphic type handling. Registration is done using new ObjectMapper.registerSubtypes() method(s).

3.5 Better support for OpenContent using @JsonAnySetter and @JsonAnyGetter

Although @JsonAnySetter annotation has been around since 1.0, to allow binding unknown properties during deserialization, there wasn't anything similar to serialize miscellaneous set of properties. But now you can use @JsonAnyGetter to annotate a method that returns a Map with values; which generally works nicely for things collected using @JsonAnySetter.

3.6 Yet More Powerful Generics Support

Although Jackson has always supported generic types reasonably well, some advanced use cases (with type variable aliasing) could lead to sub-optimal handling in 1.5. These have been improved with 1.6.

4. What Next?

Probably 1.7. :-)

Seriously though, there is no need for major backwards-incompatible change (which would mean 2.0). But some obvious bigger areas for improvements are:

  • Better support for plug-in modules for third party datatypes -- this has been planned for a while, and really needs to be done to help further improve Jackson's support for all kinds of commonly used Java datatypes. This also includes support for contextual serializers/deserializers, ability to support per-property pluggable (datatype-specific) annotations
  • Support for fully cyclic data types, object identity. This is a rather hard nut to crack, but something that is needed for complete Java Object serialization support.
  • JSONPath support, ideally at JsonParser level; possibly as filter for materializing trees. This would be ideal for many large-scale data processing operations
  • Rewrite of annotation processing part to better support concept of logical property accessed using various accessors (setter, getter and/or direct field access)
  • Advanced code generation for generating optimal (as-fast-as-hand-written) serializers, deserializers.
  • Support for serializing as XML? ("JAXB mini") -- although we promise not to support weird automatic-mappings like Badgerfish, it is not out of question that we might be able to support clean solid subset of JAXB-style code-first serialization between POJOs and XML.
  • Improved support for values of non-Java runs-on-JVM languages; Scala, Clojure, Groovy.

Which of these get tackled depends on contributions, feedback from users, and general fun-factor of working on adding things. So let your voice be heard, be it via Jackson user group, mailing lists or Jira voting.

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.