Tuesday, April 10, 2012

What me like YAML? (Confessions of a JSON advocate)

Ok. I have to admit that I learnt something new and gained bit more respect for YAML data format recently, when working on the proof-of-concept for YAML-on-Jackson (jackson-dataformat-yaml; more on this on yet another Jackson 2.0 article, soon).
And since it would be intellectually dishonest not to mention that my formerly negative view on YAML has brightened up a notch, here's my write-up on this bit of enlightenment.

1. Bad First Impressions Stick

My first look at YAML via its definition basically made my stomach turn. It just looked so much like a bad American Ice Cream: "Too Much of Everything" -- hey, if it isn't enough to have chocolate, banana and walnut, let's throw in bit of caramel, root beer essence and touch of balsamic vinegar; along with bit of organic arugula to spice things up!". That isn't the official motto, I thought, but might as well be. If there is an O'Reilly book on YAML it surely must have platypus as the cover animal.

That was my thinking up until few weeks ago.

2. Tale of the Two Goals

I have read most of YAML specification (which is not badly written at all) multiple times, as well as shorter descriptions. My overall conclusion has always been that there are multiple high-level design decisions that I disagree with, and that these can mostly be summarized that it tries to do too many things, tries to solve multiple conflicting use cases.

But recently when working on adding YAML support as Jackson module (based on nice SnakeYAML library, solid piece of code, very unlike most parsers/generators I have seen), I realized that fundamentally there are just two conflicting goals:

  1. Define a Wiki-style markup for data (assuming it is easier to not only write prose in, but also data)
  2. Create a straight-forward Object serialization data format

(it is worth noting that these goals are orthogonal, functionality-wise; but they conflict at level of syntax, visual appearance and complicate handling significantly, mostly because there is always "more than one way to do it" (Perl motto!))

I still think that one could solve the problem better by defining two, not one, format: first one with a Wiki dialect; and second one with a clean data format.
But this lead me to think about something: what if those weird Wiki-style aspects were removed from YAML? Would I still dislike the format?

And I came to conclusion that no, I would not dislike it. In fact, I might like it. A lot.

Why? Let's see which things I like in YAML; things that JSON does not have, but really really should have in the ideal world.

3. Things that YAML has and JSON should have

Here's the quick rundown:

  1. Comments: oh lord, what kind of textual data format does NOT have comments? JSON is the only one I know of; and even it had them before spec was finalized. I can only imagine a brain fart of colossal proportions caused it to be removed from the spec...
  2. (optional) Document start and end markers ("---" header, "..." footer"). This is such a nice thing to have; both for format auto-detection purpose as well as for framing for data feeds. It's bit of a no-brainer; but suspiciously, JSON has nothing of sort (XML does have XML declaration which _almost_ works well, but not quite; but I digress)
  3. Type tags for type metadata: in YAML, one can add optional type tags, to further indicate type of an Object (or any value actually). This is such an essential thing to have; and with JSON one must use in-band constructs that can conflict with data. XML at least has attributes ("xsi:type").
  4. Aliases/anchors for Object Identity (aka "id / idref"): although data is data, not objects with identity, having means to optionally pass identity information is very, very useful. And here too XML has some support (having attributes for metadata is convenient); and JSON has nada.

The common theme with above is that all extra information is optional; but if used, it is included discreetly and can be used as appropriate by encoders, decoders, with or without using language- or platform-specific resolution mechanisms.
And I think YAML actually declares these things pretty well: it is neither over nor under engineered with respect to these features. This is surprisingly delicate balance, and very well chosen. I have seen over-complicated data formats (at Amazon, for example) that didn't know where to stop; and we can see how JSON stopped too short of even most rudimentary things (... comments). Interestingly, XML almost sort-of has these features; but they come about with extra constructs (xsi:type via XML Schema), or are side effects of otherwise quirky features (element/attribute separation).

Having had to implement equivalent functionality on top of simplistic JSON construct ("add yet another meta-property, in-line with actual data; allow a way to configure it to reduce conflicts"), I envy having these constructs as first-level concepts, convenient little additions that allow proper separation of data and metadata (type, object id; comments).

4. Uses for YAML

Still, having solved/worked around all of above problems -- Jackson 1.5 added full support for polymorphic types ("type tags"); 2.0 finally added Object Identity ("alias/anchor"), use of linefeeds for framing can substitute for document boundaries -- I do not have compelling case for using YAML for data transfer. It's almost a pity -- I have come to realize that YAML could have been a great data format (it is also old enough to have challenged popularity of JSON, both seem to have been conceived at about same time). As is, it is almost one.

Somewhat ironically, then, is that maybe Wiki features are acceptable for the other main use case: that of configuration files. This is the use case I have for YAML; and the main reason for writing compatibility module (inspired by libs/frameworks like DropWizard which use YAML as the main config file format).

Monday, April 09, 2012

Data format auto-detection with Jackson (JSON, XML, Smile, YAML)

There is one fairly advanced feature of Jackson that has been around a while (since version 1.8), but that has not really been publicized a lot: data format auto-detection. Let's see how it works, and what it could be used for.

1. Format detection?

By format detection I mean ability to figure out most likely data format that a piece of content has. Auto-detection means that a piece of code can try to automatically deduce this, given set of data formats to recognize, and accessor to content.

Jackson 1.8 added such capability to Jackson, by adding one new method in JsonFactory abstract class:

  public MatchStrength hasFormat(InputAccessor acc)

as well as couple of supporting classes; and most importantly, a helper class:


that coordinates calls to produce somewhat convenience mini-API for format auto-detection.

2. Show Me Some Code!

Let's start with a simple demonstration, with known content that should be either JSON or XML:

  JsonFactory jsonF = new JsonFactory();
XmlFactory xmlF = new XmlFactory(); // from com.fasterxml.jackson.dataformat.xml (jackson-dataformat-xml)
// note: ordering is importtant; first one that gives full match is chosen:
DataFormatDetector det = new DataFormatDetector(new JsonFactory[] { jsonF, xmlF });
// let's accept about any match; but only if no "solid match" found det = det.withMinimalMatch(MatchStrength.WEAK_MATCH).withOptimalMatch(MatchStrength.SOLID_MATCH);
// then see what we get:
DataFormatMatcher match = det.findFormat("{ \"name\" : \"Bob\" }".getBytes("UTF-8")); assertEquals(jsonF.getFormatName(), match.getMatchedFormatName());
match = det.findFormat("<?xml version='1.0'?><root/>".getBytes("UTF-8"));
assertEquals(xmlF.getFormatName(), match.getMatchedFormatName();
// or:
match = det.findForm("neither really...".getBytes("UTF-8"));

which is useful if we want to display information; but perhaps even more useful, we can conveniently process the data.
So let's assume we have file "data", with format of either XML or JSON:

  // note: can pass either byte[] or InputStream
  match = det.findFormat(new File("data"));
JsonParser p = match.createParserWithMatch();
// or; if we wanted to get factory: JsonFactory matchedFactory = p.getMatch();
ObjectMapper mapper = new ObjectMapper();
User user = mapper.readValue(p, User.class);

Basically you can let DataFormatMatcher construct a parser for the matched type (note: some data formats require specific kind of ObjectMapper to be used).

3. Works on... ?

Basically, any format for which there is JsonFactory that properly implements method "hasFormat()" can be auto-detected.

Currently (Jackson 2.0.0) this includes following data formats:

  1. JSON -- can detect standards-compliant data (main-level JSON Object or Array); and to some degree other variants (scalar values at root-level)
  2. Smile -- reliably detected, especially when the standard header is written (enabled by default)
  3. XML -- reliably detected either from XML declaration, or from first tag, PI or comment
  4. YAML: experimental Jackson YAML module can detect document start marker ("---") for reliable detection; otherwise inconclusive

One existing dataformat for which auto-detection does not yet work is CSV: this is mostly due to inherent lack of header of any kind. However, some heuristic support will likely be added soon.

4. Most useful for?

This feature was originally implemented to allow for automatic detection and parsing of content that would be in either JSON, or a binary JSON (Smile) representation. For this use case, things work reliably and efficiently.

But fortunately system was designed to be pluggable, so it should actually work for a variety of other cases. Ideally this should nicely complement "universal data adapter" goal of Jackson project; so that you could usually simply just feed a data file, and as long as it is in one of supported formats, things would Just Work.

5. Caveats

Some things to note:

  1. Order of factories used for constructing DataFormatDetector matters: first one that provides optimal match is taken; and if no optimal match is found, first of otherwise equal acceptable matches is given
  2. Some data formats require specific ObjectMapper implementation (sub-class) to be used: for those formats, automatic parser creation needs to be coupled with choosing of the right mapper (this may be improved in future)

Saturday, April 07, 2012

Java Type Erasure not a Total Loss -- use Java Classmate for resolving generic signatures

As I have written before ("Why 'java.lang.reflect.Type' Just Does Not Cut It"), Java's Type Erasure can be a royal PITA.

But things are actually not quite as bleak as one might think. But let's start with an actual somewhat unsolvable problem; and then proceed with another important, similar, yet solvable problem.

1. Actual Unsolvable problem: Java.util Collections

Here is piece of code that illustrates a problem that most Java developers either understand, or think they understand:

  List<String,Integer> stringsToInts = new ArrayList<String,Integer>();
List<byte[],Boolean> bytesToBools = new ArrayList<byte[], Boolean>();
assertSame(stringsToInts.getclass(), bytesToBools.getClass();

The problem is that although conceptually two collections seem to act different, at source code level, they are instances of the very same class (Java does not generate new classes for genericized types, unlike C++).

So while compiler helps in keeping typing straight, there is little runtime help to either enforce this, or allow other code to deduce expected type; there just isn't any difference from type perspective.

2. All Lost? Not at all

But let's look at another example. Starting with a simple interface

public interface Callable<IN, OUT> {
public OUT call(IN argument);

do you think following is true also?

public void compare(Callable<?,?> callable1, Callable<?,?> callable2) {
assertSame(callable1.getClass(), callable2.getClass());

Nope. Not necessarilly; classes may well be different. WTH?

The difference here is that since Callable is an interface (and you can not instantiate an interface), instances must be of some other type; and there is a good chance they are different.

But more importantly, if you use Java ClassMate library (more on this in just a bit), we can even figure out parameterization (unlike with earlier example, where all you could see is that parameters are "a subtype of java.lang.Object"), so for example we can do

// Assume 'callable1' was of type:
// class MyStringToIntList implements Callable<String, List<Integer>> { ... }
  TypeResolver resolver = new TypeResolver();
  ResolvedType type = resolver.resolve(callable1.getClass());
  List<ResolvedType> params = type.typeParametersFor(Callable.class);
// so we know it has 2 parameters; from above, 'String' and 'List<Integer>'
assertEquals(2, params.size()); assertSame(String.class, params.get(0).getErasedType();
// and second type is generic itself; in this case can directly access
ResolvedType resultType = params.get(1);
assertSame(List.class, resultType.getErasedType());
List<ResolvedType> listParams = resultType.getTypeParameters();
assertSame(Integer.class, listParams.get(0).getErasedType();
//or, just to see types visually, try:
String desc = type.getSignature(); // or 'getFullDescription'

How is THIS possible? (fun exercise: pick 5 of your favorite Java experts; ask if above is possible, observe how most of them would have said "nope, not a chance" :-) )

3. Long live generics -- hidden deep, deep within

Basically generic type information is actually stored in class definitions, in 3 places:

  1. When defining parent type information ("super type"); parameterization for base class and base interface(s) if any
  2. For generic field declarations
  3. For generic method declarations (return, parameter and exception types)

It is the first place where ClassMate finds its stuff. When resolving a Class, it will traverse the inheritance hierarchy, recomposing type parameterizations. This is a rather involved process, mostly due to type aliasing, ability for interfaces to use different signatures and so on. In fact, trying to do this manually first looks feasible, but if you try it via all wildcarding, you will soon realize why having a library do it for you is a nice thing...

So the important thing to learn is this: to retain run-time generic type information, you MUST pass concrete sub-types which resolve generic types via inheritance.

And this is where JDK collection types bring in the problem (wrt this particular issue): concerete types like ArrayList still take generic parameters; and this is why runtime instances do not have generic type available.

Another way to put this is that when using a subtype, say:

  MyStringList list = new ArrayList<String>() { }
// can use ClassMate now, a la:
ResolvedType type = resolver.resolve(list.getClass());
// type itself has no parameterization (concrete non-generic class); but it does implement List so: List<ResolvedType> params = type.typeParametersFor(List.class);
assertSame(String.class, params.get(0).getErasedType());

which once again would retain usable amount of generic type information.

4. Real world usage?

Above might seem as an academic exercise; but it is not. When designing typed APIs, many callbacks would actually benefit from proper generic typing. And of special interest are callbacks or handlers that need to do type conversions.

As an example, my favorite Database access library, jDBI, makes use of this functionality (using embedded ClassMate) to figure out data-binding information without requiring extra Class argument. That is, you could pass something like (not an actual code sample):

  MyPojo value = dbThingamabob.query(queryString, handler);

instead of what would more commonly requested:

  MyPojo value = dbThingamabob.query(queryString, handler, MyPojo.class);

and framework could still figure out what kind of thing 'handler' would handle, assuming it was a generic interface caller has to implement.

difference may seem minute, but this can actually help a lot by simplifying some aspects of type passing, and remove one particular mode of error.

5. More on ClassMate

Above actually barely scratch surface of what ClassMate provides. Although it is already tricky to find "simple" parameterization for main-level classes, there are much more trickier things. Specifically, resolving types of Fields and Methods (return types, parameters). Given classes like:

  public interface Base<T> {
    public T getStuff();
  public class ListBase<T> implements Base<List<T>> {
protected T value;
protected ListBase(T v) { value = v; }
public T getstuff() { return value; }
} public class Actual implements ListBase<String> {
public Actual(List<String> value) { super(value; }

you might be interested in figuring out, exactly what is the type of return value of "getStuff()". By eyeballing, you know it should be "List<String>", but bytecode does not tell this -- in fact, it just tells it's "T", basically.

But with ClassMate you can resolve it:

  // start with ResolvedType; need MemberResolver
  ResolvedType classType = resolver.resolve(Actual.class);
MemberResolver mr = new MemberResolver(resolver);
ResolvedTypeWithMembers beanDesc = mr.resolve(classType, null, null);
ResolvedMethod[] members = bean.getMemberMethods();
ResolvedType returnType = null;
for (ResolvedMethod m : members) {
if ("getStuff".equals(m.getName())) {
returnType = m.getReturnType();
// so, we should get
assertSame(List.class, returnType.getErasedType());
ResolvedType elemType = returnType.getTypeParameters().get(0);
assertSame(String.class, elemType.getErasedType();

and get the information you need.

6. Why so complicated for nested types?

One thing that is obvious from code samples is that code that uses ClassMate is not as simple as one might hope. Handling of nested generic types, specifically, is bit verbose in some cases (specifically: when type we are resolving does not directly implement type we are interested in)
Why is that?

The reason is that there is a wide variety of interfaces that any class can (and often does) implement. Further, parameterizations may vary at different levels, due to co-variance (ability to override methods with more refined return types). This means that it is not practical to "just resolve it all" -- and even if this was done, it is not in general obvious what the "main type" would be. For these reasons, you need to manually request parameterization for specific generic classes and interfaces as you traverse type hierarchy: there is no other way to do it.

Friday, April 06, 2012

Take your JSON processing to Mach 3 with Jackson 2.0, Afterburner

(this is part on-going "Jackson 2.0" series, starting with "Jackson 2.0 released")

1. Performance overhead of databinding

When using automatic data-binding Jackson offers, there is some amount of overhead compared to manually writing equivalent code that would use Jackson streaming/incremental parser and generator. But how overhead is there? The answer depends on multiple factors, including exactly how good is your hand-written code (there are a few non-obvious ways to optimize things, compared to data-binding where there is little configurability wrt performance).

But looking at benchmarks such as jvm-serializers, one could estimate that it may take anywhere between 35% and 50% more time to serialize and deserialize POJOs, compared to highly tuned hand-written alternative. This is usually not enough to matter a lot, considering that JSON processing overhead is typically only a small portion of all processing done.

2. Where does overhead come?

There are multiple things that automatic data-binding has to do that hand-written alternatives do not. But at high level, there are really two main areas:

  1. Configurability to produce/consume alternative representations; code that has to support multiple ways of doing things can not be as aggressively optimized by JVM and may need to keep more state around.
  2. Data access to POJOs is done dynamically using Reflection, instead of directly accessing field values or calling setters/getters

While there isn't much that can be done for former, in general sense (especially since configurability and convenience are major reasons for popularity of data-binding), latter overhead is something that could be theoretically eliminated.

How? By generating bytecode that does direct access to fields and calls to getters/setters (as well as for constructing new instances).

3. Project Afterburner

And this is where Project Afterburner comes in. What it does really is as simple as generating byte code, dynamically, to mostly eliminate Reflection overhead. Implementation uses well-known lightweight bytecode library called ASM.

Byte code is generated to:

  1. Replace "Class.newInstance()" calls with equivalent call to zero-argument constructor (currently same is not done for multi-argument Creator methods)
  2. Replace Reflection-based field access (Field.set() / Field.get()) with equivalent field dereferencing
  3. Replace Reflection-based method calls (Method.invoke(...)) with equivalent direct calls
  4. For small subset of simple types (int, long, String, boolean), further streamline handling of serializers/deserializers to avoid auto-boxing

It is worth noting that there are certain limitations to access: for example, unlike with Reflection, it is not possible to avoid visibility checks; which means that access to private fields and methods must still be done using Reflection.

4. Engage the Afterburner!

Using Afterburner is about as easy as it can be: you just create and register a module, and then use databinding as usual:

Object mapper = new ObjectMapper()
mapper.registerModule(new AfterburnerModule());
String json = mapper.writeValueAsString(value);
Value value = mapper.readValue(json, Value.class);

absolutely nothing special there (note: for Maven dependency, downloads, go see the project page).

5. How much faster?

Earlier I mentioned that Reflection is just one of overhead areas. In addition to general complexity from configurability, there are cases where general data-binding has to be done using simple loops, whereas manual code could use linear constructs. Given this, how much overhead remains after enabling Afterburner?

As per jvm-serializers, more than 50% of speed difference between data-binding and manual variant are eliminated. That is, data-bind with afterburner is closer to manual variant than "vanilla" data-binding. There is still something like 20-25% additional time spent, compared to highest optimized cases; but results are definitely closer to optimal.

Given that all you really have to do is to just add the module, register it, and see what happens, it just might make sense to take Afterburner for a test ride.

6. Disclaimer

While Afterburner has been used by a few Jackson users, it is still not very widely used -- after all, while it has been available since 1.8, in some form, it has not been advertised to users. This article can be considered an announcement of sort.

Because of this, there may be rought edges; and if you are unlucky you might find one of two possible problems:

  • Get no performance improvement (which is likely due to Afterburner not covering some specific code path(s)), or
  • Get a bytecode verification problem when a serializer/deserializer is being loaded

latter case obviously being nastier. But on plus side, this should be obvious right away (and NOT after running for an hour); nor should there be a way for it to cause data losses or corruption; JVMs are rather good at verifying bytecode upon trying to load it.

Notes on upgrading Jackson from 1.9 to 2.0

If you have existing code that uses Jackson version 1.x, and you would like to see how to upgrade to 2.0, there isn't much documentation around yet; although Jackson 2.0 release page does outline all the major changes that were made.

So let's try to see what kind of steps are typically needed (note: this is based on Jackson 2.0 upgrade experiences by @pamonrails -- thanks Pierre!)

0. Pre-requisite: start with 1.9

At this point, I assume code to upgrade works with Jackson 1.9, and does not use any deprecated interfaces (many methods and some classes were deprecated during course of 1.x; all deprecated things went away with 2.0). So if your code is using an older 1.x version, the first step is usually to upgrade to 1.9, as this simplifies later steps.

1. Update Maven / JAR dependencies

The first thing to do is to upgrade jars. Depending on your build system, you can either get jars from Jackson Download page, or update Maven dependencies. New Maven dependencies are:

<dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.0.0</version> </dependency>

The main thing to note is that instead of 2 jars ("core", "mapper"), there are now 3: former core has been split into separate "annotations" package and remaining "core"; latter contains streaming/incremental parser/generator components. And "databind" is a direct replacement of "mapper" jar.

Similarly, you will need to update dependencies to supporting jars like:

  • Mr Bean: com.fasterxml.jackson.module / jackson-module-mrbean
  • Smile binary JSON format: com.fasterxml.jackson.dataformat / jackson-dataformat-smile
  • JAX-RS JSON provider: com.fasterxml.jackson.jaxrs / jackson-jaxrs-json-provider
  • JAXB annotation support ("xc"): com.fasterxml.jackson.module / jackson-module-jaxb-annotations

these, and many many more extension modules have their own project pages under FasterXML Git repo.

2. Import statements

Since Jackson 2.0 code lives in Java packages, you will need to change import statements. Although most changes are mechanical, there isn't strict set of mappings.

The way I have done this is to simply use an IDE like Eclipse, and remove all invalid import statements; and then use Eclipse functionality to find new packages. Typical import changes include:

  • Core types: org.codehaus.jackson.JsonFactory/JsonParser/JsonGenerator -> com.fasterxml.jackson.core.JsonFactory/JsonParser/JsonGenerator
  • Databind types: org.codehaus.jackson.map.ObjectMapper -> com.fasterxml.jackson.databind.ObjectMapper
  • Standard annotations: org.codehaus.jackson.annotate.JsonProperty -> com.fasterxml.jackson.annotation.JsonProperty

It is often convenient to just use wildcards imports for main categories (com.fasterxml.jackson.core.*, com.fasterxml.jackson.databind.*, com.fasterxml.jackson.annotation.*)

3. SerializationConfig.Feature, DeserializationConfig.Feature

The next biggest change was that of refactoring on/off Features, formerly defined as inner Enums of SerializationConfig and DeserializationConfig classes. For 2.0, enums were moved to separate stand-alone enums:

  1. DeserializationFeature contains most of entries from former DeserializationConfig.Feature
  2. SerializationFeature contains most of entries from former SerializationConfig.Feature

Entries that were NOT moved along are ones that were shared by both, and instead were added into new MapperFeature enumeration, for example:

  • SerializationConfig.Feature.DEFAULT_VIEW_INCLUSION became MapperFeature.DEFAULT_VIEW_INCLUSION

4. Tree model method name changes (JsonNode)

Although many methods (and some classes) were renamed here and there, mostly these were one-offs. But one area where major naming changes were done was with Tree Model -- this because 1.x names were found to be rather unwieldy and unnecessarily verbose. So we decided that it would make sense to try to do a "big bang" name change with 2.0, to get to a clean(er) baseline.

Changes made were mostly of following types:

  • getXxxValue() changes to xxValue(): getTextValue() -> textValue(), getFieldNames() -> fieldNames() and so on.
  • getXxxAsYyy() changes to asYyy(): getValueAsText() -> asText()

5. Miscellaneous

Some classes were removed:

  • CustomSerializerFactory, CustomDeserializerFactory: should instead use Module (like SimpleModule) for adding custom serializers, deserializers

6. What else?

This is definitely an incomplete list. Please let me know what I missed, when you try upgrading!

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.