Every day Jackson usage, part 2: Open Content

1. Open content vs Closed content

In content modeling, closed content models are ones where everything that cam be included in piece of content is fully defined, usually using static schema of some sort. Open content differs in that it only defines subset of possible content, one that is recognized, but where additional content may exist; and it is possible to also have multiple complementary content models to describe a single document. There are degrees in between; for example, it may be possible to "open up" content models by allowing some amount of content to be added on specified extension points. Open models are also useful in cases where currently used model is static (and could be described with a closed model), but where it is reasonable to expect that model may need to evolve, to be extended with additional content items.

Object models of programming languages vary from closed to open models. Java's object models is very much a closed one, for example, defined by static class definitions; while one can extend classes, instances can not be dynamically extended. On the other end of spectrum, content models of scripting languages like Javascript are open: in Javascript, you can add new properties and even functions on the fly. Content typing for more dynamic languages is sometimes called "duck typing" ("if it walks like a duck, quacks like a duck, is most likely a duck...") as opposed to traditional "static typing" (or even "dynamic typing", which just changes applicability but does not lift existence of strong typing).

When modelling data content expressed using data formats like JSON, both options are possible. Some tools only support one kind: for example, DOM API for XML supports open content model; and data binding libraries like JAXB typically support only (or mostly) closed content models derived from Java class or XML Schema definitions. Combinations are also possible: although XML and DOM allow any kind of content, one can use XML Schema as kind of straight-jacket to force a closed (or mostly closed) content model on otherwise loose data.

2. Jackson: open or closed? Both!

Jackson has traditionally support both general content model types: low-level streaming interface is content agnostic (and could be viewed as very much open model); Tree Model is for handling open content; and Data Binding supports handling of closed content, with limited functionality to support some aspects of open content models. Before having a closed looked what Jackson has to offer, let's first consider need why used Open Content may be necessary.

3. Need for Open Content

A rule of thumb regarding code and data is that data typically has much longer life-cycle than code. A crude analogy could be that whereas there are books that are hundreds years old, there are few printing presses of some age in existence. Same will be true for all kinds of documents, from papyrus to microfiche and web pages. As such, life cycle of content models -- logical constructs that define structure and content of documents -- will be potentially very long. And due to needs for backwards compatibility, content models must evolve, to accomodate for new kinds of content that co-exists with old content.

This need for evolving data models is an obvious reason why closed content models are problematic: to use a closed content model, one has to believe that the official (first) definition is basically complete (so no changes are needed); or alternatively that one has power over all systems that handle said content (to allow for changing those systems when model changes). Needless to say, few have (or should have) faith in having either of these powers.

Open content, then, is based on assumption that over time new content will be added. This does not mean haphazard changes, but rather changes that extend model by creating supersets: old model is a subset of extended version. There are actually other interesting reasons for using open content models -- for example, when multiple "facets" of an item are expressed in a single document, it may be useful to have multiple complementary partial content models; and considering them open models allows peaceful co-existence -- but for this article, it is enough to focus on particular issue of data evolution.

So where is the catch? A significant problem with open content models is that of handling open content documents by applications: given that model is likely to evolve over time (or that content is only partially described by model, due to other parts of data being irrelevant for specific application), applications need to be mindful not to assume immutability or completeness of model itself. And this is where we can go back to Jackson. :-)

4. Jackson Data Binding: simplest tools, ignoring unknown properties

Jackson 1.5 and prior allowed rudimentary ways to allow "unspecified content". For the most part it consisted of ability to specify that unrecognized ("unknown") properties can either be ignored, or handled using custom handlers. See "Ignoring Unknown Properties" for more details of how this can be done.

But while it is useful to be able to prevent processing errors when content models are extended, it has one unfortunate effect: any such unrecognized content is promptly dropped, and will not be serialized if content is to be written out as JSON again. This is harmful both for cases where new content structures are added, and where "irrelevant" parts of documents are very much relevant for other systems that process content.

5. Jackson Data Binding: handling of "any content"

Jackson 1.6 addresses issue by adding one new method for handling content extensions, which allows for better dealing with Open Content: ability to define both "any setter" and "any getter". Consider example POJO (from "@JsonAnyGetter" wiki page):

  public class ExtensibleBean
  {
public String name; // we always have name

private void HashMap<String, String> properties;

public ExtensibleBean() { }

@JsonAnySetter public void add(String key, String value) { // name does not matter
  properties.put(key, value);
}

@JsonAnyGetter public Map<String,String> properties() { // note: for 1.6.0 MUST use non-getter name; otherwise doesn't matter
  return properties;
}
  }

The idea is that basically you can have fully-defined properties (here just "name"), handled using (partial) Closed Model. But you can also specify that there is per-type general handler for anything that is not recognized. A typical handler would just use Java Map to store such values, using native simple binding (of Maps, Lists, wrappers). And not only store them (deserialized using @JsonAnySetter, which already existed since version 1.1), but also serialize them back as JSON using @JsonAnyGetter.

Ok: but given that any explicit handling of these extensions requires new code -- after all, how could you write code for handling things you don't even recognize? -- how does this help?

It helps because this way extended data can be passed through to systems that may know how to handle it. So we will avoid both

failing on unrecognized content, or worse
dropping unrecognized content silently

both of which would have nasty consequences for later versions; and latter being especially nasty as it can essentially corrupt data. And as data is typically much more valuable than code, one should rather take a processing error (which indicates a problem to fix) rather than data corruption which will only bite later, and cause much more pain.

6. Jackson Data Binding: handling of content as "untyped"

There is actually another way to deal with unknown content, in cases where extension points are defined in advance: since it is possible to use loose types like:

java.lang.Object: all JSON constructs can be mapped to specific matching "native" Java type: JSON booleans become java.lang.Boolean, JSON Strings Java Strings, JSON Arrays as Java Lists and JSON Objects as Map
org.codehaus.jackson.JsonNode: all JSON constructs can also be expressed as JSON trees, as per Jackson's Tree Model. This can also be considered "untyped" binding

it is then possible to just declare that a JSON property just maps to a Java object without knowing anything about structure of the type. The main downside is that you still need to know specific properties that are of such unknown (or loose) type. Using loose typing has others uses as well; it is commonly used for multi-step type conversions, where type may only be known at a later point (possibly as a result of data made available by initial data binding; or by a later processing stage that takes initially bound data as its input).

7. Other related work

More generally, approach taken with many processing libraries is to allow filtering out content that is not recognized. One example is StaxMate (XML processing library), which implements cursors that only show content that is specified: you get what you want to get (or "see what you want to see"); and anything else can be ignored.
This works reasonably well for read-only use cases.

Other data binding frameworks also support similar notions of foreign content: JAXB, for example, has @XmlAny annotation to denote content of unknown type; as well as ability to bind sub-trees into generic "untyped" content (express as type-agnostic DOM elements).

UPDATE, 05-October-2010: I just learnt that 1.6.0 release contained a small but annoying bug which meant that if @JsonAnyGetter method had valid getter signature (getXxx() with no arguments), properties would be duplicated. This is fixed for 1.6.1, but if using 1.6.0, make sure name is not 'getXxx()' to avoid problems.

Posted by Tatu Saloranta at Sunday, September 19, 2010 9:53 PM
Categories: JSON
| Permalink |Comments | links to this post

CowTalk

Moo-able Type for Cowtowncoder.com

Sunday, September 19, 2010

Every day Jackson usage, part 2: Open Content

Search

Last posts

Categories

Sponsored By

Archives

Related Blogs

Powered By

About me