Saturday, December 10, 2011

Sorting large data sets in Java using Java-merge-sort

When sorting data sets in Java, life is easy if amount of data to process is not huge: JDK has the basic sorting covered well. But if your data is big enough not to fit in memory you are on your own.

This often means that developers use basic Unix 'sort' command line tool. But while it is a good package for basic textual sort -- and when combined with other Unix pipeline tools, on whole range of column-based alternatives -- it is limited in two sometimes crucial aspects:

  1. Defining custom sorting (collation) order is difficult
  2. Interacting with external tools (including 'sort') from within JVM is inherently difficult

But there is one less well-known alternative available: a relatively new Java Open Source library available from Github: java-merge-sort.

1. What is java-merge-sort

Java-merge-sort library implements basic external merge sort, sorting algorithm typically used for disk-backed sorting. Input and output are not limited to files; any java.io.InputStream / java.io.OutputStream implementation will work just fine.

Sorting library is designed to work as an ad-hoc tool (in fact, Jar itself can be used as 'sort' tool) as well as a component of bigger data processing systems.

Notable features include:

  • Fully customizable input and output handlers, used for reading external data into objects to be sorted and writing them back out (handlers defined by providing factories that create instances)
  • Optional custom comparators (if items read do not implement Comparable)
  • Configurable merge factor (number of inputs merged in each pass); max memory usage (which limits length of pre-sort segments -- more memory used, fewer rounds needed)
  • Configurable temporary file handling (defaults to using JDK default temp files, deletions)
  • Ability to cancel sorting jobs asynchronously

2. Using as command-line tool

A simple way to use the library is as a stand-alone command tool; while there are no specific benefits over standard 'sort' command (assuming one is available), it can be used to test functionality. Usage is as simple as:

  java -jar java-merge-sort-[VERSION].jar [input-file]

where 'input-file' is optional (if it is missing, will read from standard input); and sorted output will be displayed to standard output.
Commonly one will then redirect output to a file:

  java -jar java-merge-sort-[VERSION].jar unsorted.txt > sorted.txt

Under the hood, this will run code from class com.fasterxml.sort.std.TextFileSorter

which is both a concrete sorter implementation, and defines main() method to act as a command-line tool.
Sort will be done line-by-line, using basic lexicographic (~= alphabetic) sort which works for common encodings like ASCII, Latin-1 and UTF-8.
Command will limit memory usage to 50% of maximum heap.

3. Simple programmatic usage: textual file sort

More commonly java-merge-sort is used as a component of bigger processing system. So let's have a look at basic usage as 'sort' replacement, i.e. sorting text files.

Code to sort an input file into output file is:

  public void sort(InputFile in, OutputFile out) throws IOException
{
TextFileSorter sorter = new TextFileSorter(new SortConfig().withMaxMemoryUsage(20 * 1000 * 1000)); // use up to 20 megs
sorter.sort(new FileInputStream(in), new FileOutputStream(out));
// note: sort() will close InputStream, OutputStream after sorting
}

which uses default configuration except for maximum memory usage (default is 40 megs: which often works just fine)

4. Advanced usage: sort JSON files

Above example showed one benefit -- easy integration from Java code -- but the real power comes from the fact that we can change input and output handlers to deal with all kinds of data, to support advanced sorting behavior. To demonstrate this, let's consider case where input is a file that contains JSON entries: each line contains a JSON Object like:

{ "firstName" : "Joe", "lastName" : "Plumber", "age":58 }

and which we want to sort primary by age, from lowest to highest, and than by name, alphabetic, first by last name, then by first name.
We can bind this to a Java class like:


  public class Person implements Comparable<Person>
  {
    public int age;
    public String firstName, lastName;

    public int compareTo(Person other) {
     int diff = age - other.age;
     if (diff == 0) {
      diff = lastName.compareTo(other.lastName);
      if (diff == 0) {
       diff = firstName.compareTo(other.firstName);
      }
     }
     return diff;
    }
  }

using Jackson JSON processor, and then sort entries using java-merge-sort.

Code to do this is bit more complicated; let's start with Sorter implementation:


import java.io.*;

import org.codehaus.jackson.JsonGenerator;
import org.codehaus.jackson.map.*;
import org.codehaus.jackson.type.JavaType;

import com.fasterxml.sort.std.StdComparator;

public class JsonPersonSorter extends Sorter<Person>
{
  public JsonFileSorter() throws IOException {
    this(entryType, new SortConfig(), new ObjectMapper());
  }

  public JsonFileSorter(SortConfig config, ObjectMapper mapper) throws IOException {
    this(mapper.constructType(Person.class), config, mapper);
  }

  public JsonFileSorter(JavaType entryType, SortConfig config, ObjectMapper mapper) throws IOException {
    super(config, new ReaderFactory(mapper.reader(entryType)),
      new WriterFactory(mapper),
      new StdComparator<Person>());
  }
}

and supporting reading-related classes are:
public class ReaderFactory extends DataReaderFactory<Person>
{
  private final ObjectReader _reader;
  public ReaderFactory(ObjectReader r) {
    _reader = r;
  }

  @Override
  public DataReader<Person> constructReader(InputStream in) throws IOException {
    MappingIterator<Person> it = _reader.readValues(in);
    return new Reader<Person>(it);
  }
}

public class Reader<E> extends DataReader<E>
{
  protected final MappingIterator<E> _iterator;
 
  public Reader(MappingIterator<E> it) {_i terator = it; }

  @Override
  public E readNext() throws IOException {
    if (_iterator.hasNext()) {
      return _iterator.nextValue();
    }
    return null;
  }

// not a good estimation, has to do for now (should count String lengths, estimate) @Override public int estimateSizeInBytes(E item) { return 100; } @Overridepu blic void close() throws IOException { } // auto-closes when we reach end }


and writing-related classes:
static class WriterFactory<W> extends DataWriterFactory<W>
{
  protected final ObjectMapper _mapper;

  public WriterFactory(ObjectMapper m) {
    _mapper = m;
  }

  @Override
  public DataWriter<W> constructWriter(OutputStream out) throws IOException {
    return new Writer<W>(_mapper, out);
  }
}

static class Writer<E> extends DataWriter<E>
{
  protected final ObjectMapper _mapper;
  protected final JsonGenerator _generator;

  public Writer(ObjectMapper mapper, OutputStream out) throws IOException {
    _mapper = mapper;
    _generator = _mapper.getJsonFactory().createJsonGenerator(out);
  }

  @Override
  public void writeEntry(E item) throws IOException {
    _mapper.writeValue(_generator, item);
    // not 100% necesary, but for readability, add linefeeds
    _generator.writeRaw('\n');
  }

  @Override
  public void close() throws IOException {
    _generator.close();
  }
}

So with all of above, we could sort a file using: JsonFileSorter sorter = new JsonFileSorter(); sorter.sort(inputFile, outputFile);

Which is pretty much identical to earlier code to sort a File; just with different reader+writer configuration.

5. Even more advanced: compress intermediate files?

There are many ways to customize processing; and one interesting idea is to actually compress intermediate files (results of pre-sort, inputs to later merge rounds); preferably using ultra-fast Java compressor like Ning LZF.

Code to do this would not be long -- it's just matter of changing DataReaderFactory and DataWriterFactory to read/write files -- but I will leave this up as an exercise to reader. :-)

6. More speed: configurations

There are two main configuration switches that can be used to improve speed:

  1. Amount of memory used for pre-sorting: more memory to use, fewer sorted segments are needed -- in fact, it may be possible to do the whole sort in memory. Default memory to use is 40 megabytes (to accomodate for default JDK max heap size of 64 megs)
  2. Number of inputs merged per round: default is 16 inputs, which should be enough; but you can increase this to reduce number of merge rounds needed (or reduce if you want to minimize number of open files, in case you encounter problems)

7. Future ideas

Looking at JSON sorting code, I realize that it would be easy to create a generic sorter that uses Jackson. And not only would this support sorting JSON files, but also files that use any other format Jackson supports, such as Smile (out of the box, with 'SmileFactory'), XML, CSV and BSON!

Saturday, October 22, 2011

On prioritizing my Open Source projects, retrospect #3

(note: continuing story, see the previous installment)

1. What was the plan again?

Ok, it has been almost 8 months since the previous priorization overview (plan was to check after 4, but time flies when you are having fun!)
High-level priority list back then had these entries:

  1. Aalto 1.0 (complete async API, impl)
  2. ClassMate 1.0
  3. Java CacheMate, ideally 1.0
  4. Tr13 1.0
  5. Externalized Mr Bean (depending on interest)
  6. Jackson 1.8
  7. Jackson-xml-databinding 1.0
  8. Work on Smile format

2. And how have we done?

This time hit rate was even bit lower (than previous one at 50%), although there was some progress. In fact, had I checked things after 4 months, only one entry would have been completed (Jackson 1.8).

Item by item, we have:

  1. Aalto: modest progress (did write a blog entry on how to use async parsing at least); still need async SAX implementation, no 1.0 (although 0.9.7 was released right after blog entry)
  2. ClassMate: minor fixes, but no 1.0 yet
  3. CacheMate: significant progress (secondary indexes); I now have 1.0 design (for "raw" in-memory), but not yet implemented -- so kind of half-done
  4. Tr13: no progress
  5. Externalized mr Bean: no demand, no progress
  6. Jackson: 1.8 released (and even more, see below)
  7. Deferred: Externalized Mr Bean -- no work done (only some preliminary scoping)
  8. Jackson-xml-databinding: bug fixes, but no 1.0
  9. Smile format: actual progress -- Pierre from Ning implemented libsmile (C), contributed Smile-detection for unix/linux 'file' command

So it's mostly modest progress and misses this time; plan was not really aligned with what was needed. Only 3 entries had significant progress.

What went wrong? Partially it's just that huge popularity of Jackson swept away many of the plans; and conversely, lack of interest in many of the entries held them back.
But additionally, many other things got implemented. So let's look at that aspect next.

3. What was done instead?

Here are things I can remember, in loose work order:

  • LZF compression ("Ning LZF") -- much progress, quite close to 1.0
  • Jackson modules, such as Afterburner and improvements to already existing ones (scala, hibernate) -- although not yet for CSV or Joda modules (which exist in skeletal form)
  • JVM-compressor-benchmark for comparing space/time efficiency of various compressors on JVM, core done (can always add codecs)
  • Low-gc-membuffers, an experimental FIFO for byte[], with native memory buffers
  • Java merge sort (file-backed configurable efficient merge sort) -- mostly done, although not declared 1.0
  • Lzf4Hadoop, Hadoop integration for LZF compression -- basically done
  • New mode for JVM-serializers benchmark, data streams, for more balanced evaluations; implemented most common codecs
  • Jackson 1.9

Quite a list eh? One completely new "branch" of development was related to LZF compression codec. And continue huge demand for all things Jackson also meant that majority of my time was spent on Jackson and its extensions.

3. Updated list

Given recent developments, popular demand, and on-going plans, here is my current thinking of main priorities:

  • Jackson CSV module: I want to add proper Jackson support for CSV, since it it still a very common (and pretty functional!) input data format, and de facto default export format for lots of data sources. And best of all, this can be done without any work on Jackson core
  • CacheMate: I really want to implement secondary caches, and have a reasonable design (in many ways similar to persistence used by Cassandra/BigTable/HBase) on how to go about it
  • Jackson 2.0: move to github, refactor, redesign, remove deprecated things -- major renovation, to lay foundation for longer term 2.x development
  • ClassMate: getting to official 1.0 would be good, as well as writing blog entry or two on actual usage
  • Jackson XML data binding: fix bugs, declare 1.0, easier to market that way. And of course document
  • Ning-compress (LZF) 1.0: already functional, and feature-wise as good as 1.0, but there are couple of optimization tricks (by mr Dain S who ported Snappy to Java) that I'd still like to investigate, before declaring things 1.0

Other interesting things that might get included are:

  • Aalto 1.0: it would be good to sort of declare it done by implementing Async SAX, announcing the first non-beta release
  • Externalized mr Bean (BeanMate?) still looks like a potentially useful thing that others would want to use (this above and beyond basic refactoring that Jackson 2.0 would dictate, i.e. splitting of the jar as first-level new module)
  • Standardization work for Smile?
  • Maybe even design a splittable variant of LZF (Splitty? Splitz?) -- with improved usage of length indicators (VInts), designed so implementation can be even faster than LZF (on par with Snappy java), yet allow splittability which would be very valuable for Map/Reduce tasks

I expect above list to of course have at most 50% success rate, and for other good stuff to be worked on instead. Especially with likely changes to my daytime job, with possibly changing roles at day-to-day work, changes that will likely boost priority of some other open source efforts, reduce that of others.

Tuesday, October 11, 2011

Jackson 1.9 new feature overview

Jackson 1.9 was just released. As usual, it can downloaded from the Download page, and detailed release information can be found from 1.9 release page.

Let's have a look into contents of this release.

1. Overview

One of focus areas on this release was once again to tackle oldest significant issues and improvement ideas; and two of major new features are long-standing issues (ability to inline/unwrap JSON values; unify annotation handling for getters/setters/fields). Another big goal was to improve ergonomics: to simplify configuration, shorten commonly used usage patterns and so on. And finally there was also intent to try to "2.0 proof" things, by trying to figure out things that need to be deprecated to allow removal of obsolete methods as well as indicate cases where improved functionality is available.

2. Major features

(note: classification of features into major, medium and minor categories is not exact science, and different users might consider different things more important than others -- here we simply use categorization that the release page uses)

Major features included in 1.9 are:

  • Allow inlining/unwrapping of child objects using @JsonUnwrapped
  • Rewrite property introspection part of framework to combine getter/setter/field annotations
  • Allow injection of values during deserialization
  • Support for 'external type id' by adding @JsonTypeInfo.As.EXTERNAL_PROPERTY
  • Allow registering instantiators (ValueInstantiator) for types

2.1 @JsonUnwrapped

Ability to map JSON like

  {
    "name" : "home",
    "latitude" : 127,
    "longitude" : 345
  }

to classes defined as:

  class Place {
    public String name;

@JsonUnwrapped public Location location; }
class Location { public int latitude, longitude; }

has been on many users' wish list for a while now; and with addition of @JsonUnwrapped (used as shown above) this simple structural transformation can now be achieved without custom handling

2.2 "Unified" properties, merging ("sharing") of annotations of getters/setters/fields

Another long-standing issue has been that of isolation between annotations used by getters, setters and fields. Basically annotation added to a getter was only ever used for serialization, and would never have any effect on deserialization; similarly setter never affected deserialization. While this is not a problem for many annotation use cases, it would make following use case work quite different from what users intuitively expect:

  class Point {
@JsonProperty("width")
public int getW();
public void setW(int w); // must be separately renamed
}

which would actually lead to there being two separate properties: "width" that is written out during serialization; and "w" that is expected to be received when deserializing. Many users would intuitively expect annotation to be "shared" between two parts of logically related accessors. Same issue also affects annotations like @JsonIgnore and @JsonTypeInfo, requiring use of seemingly redundant annotations.

Jackson 1.9 solves this by adding new internal representation of logical property, and merging resulting annotations using expected priorities (meaning that annotations on a getter have precedence over setter when serializing, and vice versa).

There are also other more subtle changes, related to these changes. For example, class like:

  class ValueBean {
    private int value;

    public int getValue() { return value; }
  }

can now be deserialized succesfully, even without field "value" being visible or annotated: since it is joined with getter ("getValue()"), and getter is explicitly annotated, field is included as the accessor to use for assigning value for the property.

The last important benefit of this feature is that now handling of Jackson and JAXB annotations is much more similar, which should make JAXB annotations works better as a result (code was simplified significantly) -- this because JAXB had always considered annotations to be shared in this way.

2.3 Value Injection for Deserialization

Value injection here means ability to insert ("inject") values into POJOs outside of general data binding: that is, values that do not come from JSON input. Instead, values to inject are specified during configuration of ObjectMapper or ObjectReader used for data binding.

Why is this needed? Some Java types require additional context information to be able to construct POJO instances, for example. And in other cases, you may want to pre-populate values of some fields; and while there are other mechanims (for example, you can pass an existing POJO instance for "updateValue()") method) they are quite limited.

Only two things are needed for value injection:

  1. Means to indicate properties for which values are to be injected, and
  2. Definition of values to inject

Default mechanism is to handle first part by using new annotation, @JacksonInject, so that we could have:

  public class InjectableBean
  {
    @JacksonInject("seq") private int sequenceNumber;
    public String name;
  }

and second part is handled by allowing configuration of ObjectMapper or ObjectWriter instance with InjectableValues, object that can find values to inject given value id. Value ids can be specified as either Strings, or as Classes; if Class is used, Class.getName() is used to get actual String id to use. For above POJO, we could handle deserialization as follows:

  ObjectMapper mapper = new ObjectMapper();
  Integer sequenceNumber = SequenceGenerator.next(); // or whatever
  InjectableValues inject = new InjectableValues.Std()
   .addValue("seq", id)
  final String json = "{\"name\":\"Lucifer\"}";
  InjectableBean value = mapper.reader(InjectableBean.class).withInjectableValues(inject).readValue(json);

For more on this feature, check out FasterXML Wiki's entry on Value Injection.

2.4 External Type Id

Jackson has had support for full polymorphic type handling since 1.5, allowing configuration of both type identifier in use (usually either a class name, or logical type name) and type inclusion mechanism (as property, as wrapper array, as single-element wrapper object).
This covers wide range of usage scenarios, but there is one inclusion mechanism that is sometimes used but could not be supported by Jackson: that of using "external type identifier". This style of type inclusion is used by some data formats, most notably geoJSON.

By external type identifier we mean case such as this:

 {
  "type" : "rectangle",
  "shape" :  {
   "width": 20.0,
   "height" : 40.0
  }
 }

where type is included as a property ("type") that is outside of JSON Object being typed.

With 1.9 we can support such use case by using @JsonTypeInfo with a new inclusion value:

  public class ShapeContainer
  {
    @JsonTypeInfo(use=Id.NAME, include=As.EXTERNAL_PROPERTY, property="type")
    public Shape shape;    
  }
 
static class Shape { }
@JsonTypeName("rectangle") // or rely on class name, Rectangle static class Rectangle extends Shape { public double width, height; }

One thing to note here is that this inclusion mechanism should only be used with properties; annotating classes with @JsonTypeInfo that indicates external type identifiers can cause conflicts.

2.5 Value instantiators

And last but not least, 1.9 also allows much more control over mechanism used to create actual POJO value instances. While Jackson 1.2 added support for @JsonCreator annotation, there has not been a way to add custom creator objects.

With 1.9, we get following pieces:

  • ValueInstantiator (abstract class), extended by objects used to create value instances
  • ValueInstantiators (interface), provider for per-type ValueInstantor instances (as well as ValueInstantiators.Base abstract class for actual implementations)
  • Module.setupContext method addValueInstantiators(); as well as SimpleModule method addValueInstantiator(), for adding provider(s), so modules can easily provide instantiators for types they support
  • @JsonValueInstantiator annotation that can be used as an alternative to specify instantiator used for annotated type.

Above pieces are basically enough to support all three modes of construction @JsonCreator allows (so basically @JsonCreator could be implemented as module, if we wanted!):

  1. "Default" construction that takes no arguments and uses no-argument constructor or factory method
  2. "Delegate-based" construction, in which JSON value is first bound to an intermediate type (such as java.util.Map or Jackson JsonNode), and this instance is passed to single-argument creator method
  3. "Property-based" construction, in which one or more named values (JSON properties) are bound to specified types that match creator arguments, and these are passed to creator method.

Mapping of above construction methods to ValueInstantiator methods is fairly straight-forward:

  1. Simple no-arguments construction (ValueInstantiator.createUsingDefault()): used if the other construction mechanisms are not available: consumes no JSON properties.
  2. Delegate-based construction (ValueInstantiator.createUsingDelegate(Object)): similar to annotating a single-argument constructor or factory method with @JsonCreator, but NOT specifying argument name with @JsonProperty. If specified (i.e. value instantiator indicates it supports this), JSON value for property is first bound into intermediate (delegate) type, and then this value is passed to delegate creator method. Jackson mapper will handle all the details of initial binding, passing delegate object as the argument.
  3. Property-based construction (ValueInstantiator.createFromObjectWith(Object[] args)): similar to using @JsonCreator with arguments that all have @JsonProperty annotation to specify JSON property name to bind.

It is worth noting that order in which availability of different modes is checked is reverse of above: first a check is made to see if property-based method is available; if not, then delegate-based, and finally default construction.

Since this is possibly the most complicated new feature, I will need to defer a full example to another blog post. But let's consider a very simple ValueInstantiator implementation that just supports the default (no-argument) instantiation:

  class SimpleInstantiator extends ValueInstantiator
  {
    @Override public String getValueTypeDesc() { // only needed for error messages
      return MyType.class.getName();
    }

    @Override // yes, this creation method is available
    public boolean canCreateUsingDefault() { return true; }

    @Override
    public MyType createUsingDefault() {
      return new MyType(true);
    }
  }

and similarly you can add support for delegate- or property-based methods.

3. Other notable features

Aside from above-mentioned major features, there are many other useful improvements:

  • "mini-core" jar (jackson-mini-1.9.0.jar)
  • DeserializationConfig.Feature.UNWRAP_ROOT_VALUE
  • @JsonView for JAX-RS methods to return a specific JsonView
  • Terse(r) Visibility: ObjectMapper.setVisibility(), VisibilityChecker.with(Visibility)
  • Add standard naming-strategy implementation(s)
  • Add JsonTypeInfo.defaultSubType property to indicate type to use if class id/name missing
  • Add SimpleFilterProvider.setFailOnUnknownId() to disable throwing exception on missing filter id

"Mini core": as name suggests, there is now a new jar (jackson-mini-1.9.0.jar) that is about 40% smaller than the default one -- about 136kB or so. Size reduction is achieved by leaving out text files (LICENSE), as well as annotations, but otherwise functionality is equivalent to standard core package, i.e. supports streaming API (JsonParser/JsonGenerator, JsonFactory).

DeserializationConfig.Feature.UNWRAP_ROOT_VALUE is counterpart to SerializationConfig.Feature.WRAP_ROOT_VALUE; and there is also now a new annotation -- @JsonRootName -- that can be used to use custom wrapper name instead of the simple class name. This is useful with interoperability, as some frameworks insist on adding such wrappers.

One of few improvements to JAX-RS provider is that now you can add @JsonView annotation to JAX-RS resource methods, and if one is found, it will be set as the active Serialization View during serialization of the result value.

One nice ergonomic improvement is the ability to use much more compact configuration methods for changing default introspection visibility levels.
For example, you can use:

  objectMapper.setVisibility(JsonMethod.FIELD, JsonAutoDetect.Visibility.ANY);

to make all fields auto-detectable, regardless of their visibility. Or, to prevent all auto-detection, you could use:

  objectMapper.setVisibilityChecker(m.getVisibilityChecker()
  	.with(JsonAutoDetect.Visibility.NONE));

An improvement to naming strategy support is inclusion of one "standard" naming strategy -- CAMEL_CASE_TO_LOWER_CASE_WITH_UNDERSCORES -- which converts between standard Java Bean names (that setters and getters use), and C-style names (like used by Twitter). You can enable this converter by:

  mapper.setPropertyNamingStrategy(PropertyNamingStrategy.CAMEL_CASE_TO_LOWER_CASE_WITH_UNDERSCORES);

and from there on, can consume JSON like:

 { "first_name" : "Joe" }

to bind to class like:

public class Name { public String firstName; }

without having to use @JsonProperty to fix name mismatch.

As to sub-typing, you can now use new @JsonTypeInfo property defaultSubType to indicate, as name suggests, default sub-type to use in case where type name was missing or could not be resolved: use it like:

  @JsonSubType(use=Id.NAME, include=As.PROPERTY, defaultSubType=GenericImpl.class)
  public abstract class BaseType { }

And finally, one improvement to Json Filter functionality is ability to specify that it is ok to use a filter id that does not refer to an actual filter (i.e. can not be resolved by the currently configured filter provider) -- use 'SimpleFilterProvider.setFailOnUnknownId(false)' to make this the default behavior. Missing filter is then assumed to mean "no filtering", that is, serialization is handled as if no filter was specified.

Wednesday, September 28, 2011

Advanced filtering with Jackson, Json Filters

I wrote a bit earlier on "filtering properties with Jackson". While it was comprehensive in that all main methods of filtering were covered, there wasn't much depth. Specifically, only very basic usage of Json Filters (@JsonFilter annotation, SimpleFilterProvider as provider) was considered. This approach does allow more dynamic filtering than, say, @JsonView, but it is still somewhat limited. So let's consider more advanced customizability.

1. Refresher on Json Filters

Ok, so the basic idea with Json Filters is that:

  1. Classes can have an associated Filter Id, which defines logical filter to use.
  2. A provider is needed to get the actual filter instance to use, given id: this will be configured by assigning a FilterProvider (such as 'SimpleFilterProvider') to ObjectMapper or ObjectWriter.
  3. Jackson will dynamically (and efficiently) resolve filter given class uses, dynamically, allowing per-call reconfiguration of filtering.

From this it is clear that there are 2 main things you can configure: mechanism that is used to find Filter id of a given class, and mechanism used for mapping this id to actual filter used (implementation of which can be as complicated as you want).

So let's have a look at both parts.

2. Configuring mapping from id to filter instance

Of mechanisms, latter one may be easier to understand and use: one just has to implement 'FilterProvider', which has but one method to implement:

  public abstract class FilterProvider {
    public abstract BeanPropertyFilter findFilter(Object filterId);
  }

given this, 'SimpleFilterProvider' is little more than a Map<String,BeanPropertyFilter>, except for adding couple of convenience factory methods that build 'SimpleBeanPropertyFilter' instances given property names, so you typically just instantiate one with calls like:

  SimpleBeanPropertyFilter filter = SimpleBeanPropertyFilter.filterOutAllExcept("a"));

which would out all properties except for one named "a". This filter is then configured with ObjectMapper like so:

  FilterProvider fp = new SimpleFilterProvider().addFilter("onlyAFilter", filter);
  objectMapper.writer(fp).writeValueAsString(pojo);

which would, then, apply to any Java type configured to use filter with id "onlyAFilter".

3. Configuring discovery of filter id

From above example we know we need to indicate classes that are to use our "onlyAFilter". The default mechanism is to use:

  @JsonFilter("onlyAFilter")
  public class FilteredPOJO {
    //...
  }

But this is just the default. How so? The way Jackson figures out its annotation-based configuration is actually indirect, and fully customizable: all interaction is through configured 'AnnotationIntrospector' object, which amongst other things defines this method:

  public Object findFilterId(AnnotatedClass ac);

which is called when serializer needs to determine id of the filter to apply (if any) for given class. Since the default implementation (org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector) has everything else working fine, what we can do is to sub-class it and override this method.
For example:

  public class MyFilteringIntrospector extends JacksonAnnotationIntrospector
  {
    @Override
    public Object findFilterId(AnnotatedClass ac) {
      // First, let's consider @JsonFilter by calling superclass
      Object id = super.findFilterId(ac);
      // but if not found, use our own heuristic; say, just use class name as filter id, if there's "Filter" in name:
      if (id == null) {
        String name = ac.getName();
        if (name.indexOf("Filter") >= 0) {
          id = name;
        }
      }
      return id;
    }
  }

Above functionality is just to show what is possible, not that it makes sense. Alternatively you could of course define your own annotations to check; or have List of known class names, check class definition or interfaces type implements. The main point is just that you are not limited to using @JsonFilter annotation, but can use pretty much any logic you want, within limits of your coding skills.

The only caveat is that the resolution from Class to matching id is only guaranteed to be called once per ObjectMapper; so any variation in filtering of specific class needs to happen at either mapping of id to filter, or within filter itself.

4. Don't be afraid of sub-classing (Jackson)AnnotationIntrospector

Actually, the key take away might as well be the fact that AnnotationIntrospector is designed to be customizable. It was initially created to allow easy reuse of JAXB annotations (via JAXBAnnotationIntrospector; combining things with AnnotationIntrospector.Pair); but it is also a very powerful general-purpose customization mechanism. But at this point quite underused one at that.

5. Addendum

Some additional notes based on feedback I received:

  • Custom BeanPropertyFilter implementations are obviously powerful too: not only can they completely change what (if anything) gets written for property, they can base this on all configuration accessible via SerializerProvider which is passed to serializeAsField(): for example, it can check to see what serialization view is available by calling 'provider.getSerializationView()'.

Friday, August 12, 2011

Traversing JSON trees with Jackson

1. Three models to rule the...

One of three canonical JSON processing models, tree model, may look a bit like a red-headed stepchild. The amount of effort so far spent on both developing and documenting Jackson data-binding functionality is an order of magnitude higher than all the work for tree model functionality. And considering how much more effort using stream-based processing takes, surprisingly many developers choose it over tree handling.

2. Why I never really liked tree model that much

I confess to having slight aversion to using JSON trees as well; but I have a reasonable excuse: I grow to hate tree-based models with XML. Having survived bad experiences of XML DOM processing (which is both cumbersome and inefficient at same time) tends to inoculate one against further infections. I know this is bit of unjustified bias, considering that most problems with DOM had nothing to do with the basic idea of an in-memory tree model (and not all even due to it being XML...)

3. ... even though I perhaps should have

But Jackson actually does provide reasonable support for JSON trees with its JsonNode based model, and many brave developers have put it to good use. And due to Jackson's extensive support for efficient conversions between models (that is, ability to both combine approaches and to convert data as needed), you don't have to pick and choose just one model but can combine strengths of each model. Tree model's expressive power is actually very useful when doing pre- or post-processing of data binding; or when building quick prototype systems.

4. Basics

A "JSON tree" is defined by one simple thing: org.codehaus.jackson.JsonNode object that acts as the tree of the logical tree. The root node is usually of type 'ObjectNode' (and represents JSON Object), but most operations (all read-operations, specifically) are exposed through basic JsonNode interface.

There are three basic options for creating a JSON tree instance, all accessible via ObjectMapper:

  1. Parse from a JSON source: JsonNode root = mapper.readTree(json);
  2. Convert from a POJO: JsonNode root = mapper.valueToTree(pojo); // special case of 'ObjectMapper.convertValue()'
  3. Construct from scratch: ObjectNode root = mapper.createObjectNode();

The choice largely depends on use case, that is, what do you have to work with; whether you generating new tree from scratch, or modify an existing JSON structure.

After you have the root node you can traverse it modify structure, and convert to other representations (serialize as JSON, convert to a POJO).

5. Back & Forth

Aside from the ability to convert a POJO to a tree, you can easily do the reverse using "ObjectMapper.treeToValue()". Or, if you happen to need a JsonParser, use "ObjectMapper.treeAsTokens()". And to create actual textual JSON, the regular "ObjectMapper.writeValue()" works as expected.

In fact, from ObjectMapper's perspective, JsonNode is just another Java type and is handled using serializers, deserializers which can be overridden if you want to customize handling. You can even replace JsonNodeFactory that ObjectMapper uses, if you want to provide custom JsonNode implementation classes!

6. More convenient traversal

One of things that has quietly improved over time has been traversal. Earliest Jackson versions just supported basic traversal like so:

  JsonNode root = mapper.readTree("{\"address\":{\"zip\":98040, \"city\":\"Mercer Island\"}}");
  JsonNode address = root.get("address");
  if (address != null && address.has("zip")) {
    int zip = address.get("zip").getIntValue();
  }

but it soon became apparent that null checks are a worthless hassle, so alternative access, "path()" was quickly added. It allows for traversing over virtual paths, without worrying whether a node exists: if one does not exist, it will just be evaluate as "missing node" when trying to access actual leaf value:

  JsonNode root = ...;
  int zip = root.path("address").path("zip").getValueAsInt(); // if no such path, returns 0
  // could also do:
  JsonNode zipNode = root.path("address").path("zip");
  if (zipNode.isMissingNode()) { // true if no such path exists
  }

This is fine and dandy for read-only use cases, but it does not help when trying to add things -- while you can traverse path that does not really exist, you can not add anything to it. To address this shortcoming, Jackson 1.8 comes equipped with "with()" method, which will actually create the path if it does not exist. So you can finally write something like this:

  JsonNode root = ObjectMapper.createObjectNode();
  // note: JsonNode.with() returns 'JsonNode'; but ObjectNode.with() 'ObjectNode' -- go contra-variance!
  root.with("address").put("zip", 98040);

which actually makes Jackson Tree usage almost as convenient as I would like it to be. It is especially useful when materializing full trees from scratch: you can implicitly build the tree structure just by traversing it!

7. More?

Jackson tree model is still somewhat spartan, especially compared to features galore of data binding. Going forward it would be nice to add support for things like:

  • Simple path language (JsonPath, JsonQuery?) support, to be able to evaluate expressions to locate nodes.
  • Filtering during construction, to create trimmed/pruned trees, sub-trees
  • More advanced find methods? There already exists a few "findXxx()" methods in JsonNode, but more would make sense, esp. with configurable matchers or filters
  • Method names are bit too verbose (mostly due to historical reasons -- I didn't realize early enough that long method names can hurt when chained calls are used)

But as usual, much of Jackson development work is feedback-driven -- features that get used also get more likely further improved. So if you do find Tree Model useful, let development team know that!

Thursday, August 11, 2011

One of coolest, least well-known Jackson features: Mr Bean, aka "abstract type materialization"

1. Quest for simplest JSON processing, eliminating monkey code: "struct classes"

I have found myself using "Java structs" quite often, when accessing JSON services from Java. By this I mean simple public-field-only classes like:


public class RequestDTO {
 public long requestId;
 public String callerId;
}

While many Java newbies think there is something wrong in using public fields, there is actually very little harm in using such classes for simple data transfer, if no actual business logic is needed for classes themselves.

2. But sometimes "real" classes would be nice

Then again, sometimes it would be nice to use more full-featured Bean(-like) POJOs. Perhaps we want to add some input validation for setters; or add convenience accessors, or even just occasional 'toString()' implementation.

For above example, we might want to get something like:


public class RequestImpl
{ private long requestId; private String callerId; public RequstImpl() { } public long getRequestId() { return requestId; } public String getCallerId() { return callerId; } public void setRequestId(long l) { requestId = l; public void setCallerId(String s) { callerId = s; } @Override public String toString() { return String.format("[request: id %d, caller %s]", requestId, callerId); } }

But ideally we would usually just define something like


public interface Request {
  public long getRequestId();
  public String getCallerId();

  public void setRequestId(long l);
  public void setCallerId(String s);
}

and somehow get an implementation; alas, that usually means writing boiler-plate implementation for that interface (and if we are masochists, sometimes even intermediate abstract classes...)

So what's the problem here? I don't particular like writing monkey code to declare basic setters, getters, and fields; especially when there is nothing interesting going on there, just mechanical typing. And while one can use IDEs to generate sources, this only helps with bootstrapping: you still get more source code to maintain, which translates to more place where bugs may hide when definitions are edited. Similarly various annotation-based post-processors seem alien to me if they just produce more source code to compile.

3. So why not just like... get implementations "materialize"?

But while I don't like the idea of getting yet more source code generated to be compiled, maintained, I do like the idea of getting actual implementation classes dynamically.

And this is where entry #6 of "7 Jackson killer features" comes in: enter mr. Bean! When enabled, it can actually materialize concrete implementations as needed.

4. Mr Bean: basics

(from FasterXML Mr Bean Wiki page)

Basic usage is simple: you need jackson mrbean jar (included in Jackson distribution), and need to enable functionality with:


  ObjectMapper mapper = new ObjectMapper();
  mapper.registerModule(new MrBeanModule());

and then just watch interfaces appear: for example, with above example:


  Request request = objectMapper.readValue(jsonInput, Request.class); // where Request is an interface

What happens here is that mr Bean extension hooks with ObjectMapper, and whenever an abstract type is encountered and there is no concrete class available (no abstract type mapping; no annotation to indicate concrete type; no @JsonTypeInfo to provide subtype information), it is asked to "materialize" concrete type.

Materialization simply means generating bytecode using ASM, based on getters and/or setters; adding necessary internal fields, loading class and returning it to caller. After this, core Jackson mapper can introspect all information it needs, and what you get is an instance of this implementation. Implementations are cached for later use, and performance-wise they behave similarly to manually implemented ones would.

5. Mr Bean: but wait! There's more!

Ok: so we can get monkey code materialized: getters and setters are implemented, and internal fields added to store values. But this is just the beginning.

First: if you do not need to use setters yourself you can freely omit them from interface definition.
Mr Bean is smart enough to figure out that setters are typically needed to set values (or public fields) if there are getters materialized.
So you can simplify your interfaces/abstract classes to look something like:


  public interface RequestWithoutSetters {
    public long getRequestId();
    public String getCallerId();
  }

and things will still work just fine; you can't access setters (which actually may be a good thing), but Jackson data binder can populate values just fine (internally either setters get generated; or public fields added to implementation, this is an implementation detail).

Aside from simplistic get/set Bean it is more commont to want a partial implementation; an abstract class where you provide some methods and/or fields, but can leave implementation of trivial properties to Mr Bean. This it can do just fine: mr Bean can materialize abstract classes, just "filling in the blanks".

So you can ask for a class like:


  public abstract class RequestBase {
    public long getRequestId();
    public String getCallerId();
    
    @Override public String toString() {
      return String.format("[request: id %d, caller %s]", requestId, callerId); }
    }
  }

and things work, well, as expected. Note, too, that you can implement setters and getters, not just "other" methods.

And finally: you can use annotations normally as well, adding them to your interface/abstract class definition. Thanks to Jackson's powerful and versatile annotation handling (including annotation "inheritance" for methods), you can do something like:


  // JSON we get has weird names; need to annotate
  public abstract class RequestBase {
    @JsonProperty("REQID")
    public long getRequestId();
    @JsonProperty("CALLERID")
    public String getCallerId();
  }

and get things configured as per annotations.

6. Known issues?

Mr Bean seems to work to degree I need it to work. But there are some potential concerns you may need to be aware of:

  • Jackson has multiple ways of dealing with abstract types: do you want bean materialized or not? As mentioned above, mr Bean does not try to materialize abstract types that seem to expect different kind of handling; for example, if interface has @JsonTypeInfo annotation, assumption is that polymorphic handling can figure out actual type. But it is possible that there are corner cases (esp. when using "default typing") there might be conflicts. So polymorphic types may not mix well with mr Bean materialization
  • Generic signatures may not be added as expected. Although you can declared generic types for abstract methods just fine, and Jackson mapper should fine declarations, there are some issues due to complexities in getting generic declarations work with ASM. You may need to use additional annotations (@JsonDeserialize(contentAs=...)) in some cases.

Above is just a list of potential concerns -- as far as I know, they haven't been found to be much of a problem in actual use so far.

7. What Next?

Usage, usage, usage! It would be great to get more Jackson users use this potentially hugely work-saving feature. And if you find the feature useful, make sure to let your friends know! (if you hate it, just let me know :-) ).

Tuesday, July 26, 2011

Jackson tips: using @JsonAnyGetter/@JsonAnySetter to create "dyna beans"

One relatively common "special" POJO is so-called dynamic bean ("dyna-bean"), which is sort of combination of regular bean and basic Java Map; with zero or more properties with known name, and extensible set of 'other' key/value pairs.

Here's what such a POJO might look like:


public class DynaBean
{
    // Two mandatory properties
    protected final int id;
    protected final String name;

    // and then "other" stuff:
    protected Map<String,Object> other = new HashMap<String,Object>();

    public DynaBean(int id, String name)
    {
        this.id = id;
        this.name = name;
    }

    public int getId() { return id; }
    public String getName() { return name; }

    public Object get(String name) {
        return other.get(name);
    }

    public void set(String name, Object value) {
        other.put(name, value);
    }
}

Since Jackson can serialize Bean as well as Maps, what is the problem? As presented, bean would not serialize and deserialize as expected, although it could be modified to just return Map of "other" properties, and deserialize them back. This would work, but would result in an additional level of wrapping, so that secondary properties would be within a separate JSON Object.

But Jackson can actually be made to work with such POJOs: here is one way to do it:


public class DynaBean
{
    // Two mandatory properties
    protected final int id;
    protected final String name;

    // and then "other" stuff:
    protected Map<String,Object> other = new HashMap<String,Object>();

    // Could alternatively add setters, but since these are mandatory
    @JsonCreator
    public DynaBean(@JsonProperty("id") int id, @JsonProperty("name") String name)
    {
        this.id = id;
        this.name = name;
    }

    public int getId() { return id; }
    public String getName() { return name; }

    public Object get(String name) {
        return other.get(name);
    }

    // "any getter" needed for serialization    
    @JsonAnyGetter
    public Map<String,Object> any() {
        return other;
    }

    @JsonAnySetter
    public void set(String name, Object value) {
        other.put(name, value);
    }
}

And there we have it: serializes and deserializes nicely.

Share and enjoy...

Monday, July 04, 2011

Jackson Annotations: @JsonCreator demystified

One of more powerful features of Jackson is its ability to use arbitrary constructors for creating POJO instances, by indicating constructor to use with @JsonCreator annotation. A simple explanation of this feature can be found from FasterXML wiki; but it only scratches surface of all the power this annotation exposes. This article will expand on what can be done with this annotation.

1. What are Creators in Jackson?

"Creator" refers to two kinds of things: constructors, and static factory methods: both can be used to construct new instances. Term "creator" is used for convenience, to avoid repeating "constructor or static factory method".

In some places term "creator method" may be used, although this is not preferred (since constructors are not regular methods). In case of static factory methods, method's return type must be same as declaring classes type, or its subtype.

2. Two kinds of creators: property-based, delegate-based

There are two kinds of creators that one can denote using @JsonCreator: property- and delegate-based creators:

  • Property-based creators take one or more arguments; all of which MUST be annotated with @JsonProperty, to specify JSON name used for property. They can only be used to bind data from JSON Objects; and each parameter represents one property of the JSON Object; type of property being used for binding data to be passed as that parameter when calling creator.
  • Delegate-based creators take just one argument, which is NOT annotated with @JsonProperty. Type of that property is used by Jackson to bind the whole JSON value (JSON Object, array or scalar value), to be passed as value of that one argument.

In addition, sometimes distinction is made between two kinds of delegate-based creators: those that take scalar value (int/Integer/long/Long, string or boolean/Boolean) and other delegates. There is only one important distinction from user perspective: there is no creator overloading, so only one creator of each type is alllowed. More on this later on.

3. Property-based creators

Property-based creators are typically used to pass one or more obligatory parameters into constructor (either directly or via factory method). If a property is not found from JSON, null is passed instead (or, in case of primitives, so-called default value; 0 for ints and so on).

A typical use case could look like this:


  public class NonDefaultBean {
    private final String name;
    private final int age;

    private String type;
  
    @JsonCreator
    public NonDefaultBean(@JsonProperty("name") String name, @JsonProperty("age") int age)
    {
      this.name = name;
      this.age = age;
    }

    public void setType(String type) {
      this.type = type;
    }
  }

where two properties are passed via constructor creator; and then a third one through regular setter. Note that we could have as well used a static factory method as creator.

One thing to note about these creators is that it is possible to create a sub-type of class: this allows implementing polymorphic handling manually if necessary, although with limitation that creator itself must be in base class. But all other properties can be passed to sub-classes, as Jackson is smart enough to check type and properties of the actual instance, not just declared nominal type.

4. Delegate-based creators

Whereas you can use multiple arguments with properties-based creators (but must explicitly name them), delegate-based creator takes just one argument. Jackson will then use type of that argument for data-binding, and bind JSON to that type before calling creator. Creator is then free to construct instance any way it wants to. A typical usage looks like:


  public class ObjectDelegateBean
  {
    private final String name;
    private final int age;
    private final String type;

    @JsonCreator
    public ObjectDelegateBean(Map<String,Object> props)
    {
      name = (String) props.get("name");
      age = (Integer) props.get("age");
      type = (String) props.get("type");
    }
  }  

In this case we just manually extract properties from a Map (where JSON data has been bound to "natural" types; Maps, Lists, String, Numbers and Booleans). Other common delegate types used are JsonNode (to use JSON tree as intermediate form), TokenBuffer (stream of exact JsonTokens, which can be used to create a JsonParser) and basic java.lang.Object (which would map to natural types mentioned earlier).

But as I mentioned earlier, it is also possible to use variants to handle scalar types: specifically ints (or Integers), long (or Longs) and Strings (in future support will also be added for double/Double).

For example:


  public class DateBean
  {
    private Date date;

    private DateBean(Date date) {
      this.date = date;
    }

    @JsonCreator
    public static DateBean factory(long timestamp) {
      Date d = new Date(timestamp);
      return new DateBean(d);
    }
}

which is bit contrived example, as Jackson can easily bind java.util.Date directly. However, some JSON formats support overloading, or try to minimize size; and JSON Strings are sometimes used to bundle semi-structured data

One point to note about "scalar delegates" is that Jackson allows overload of creators, as long as there is only one creator type per JSON type: so you can have only one properties-based creator (or delegate-based creator that takes something other than match of JSON scalar type) , but there can be an additional string-delegate creator as well as an int-delegate creator. Limitation is conceptually simple: there must only be one applicable delegate to choose from, knowing type of JSON value being bound -- and this is why delegates that take String, Integer/int or Long/long are special cases of delegate-based creators.

5. More to know?

In near future (Jackson 1.9) there will be significant improvements in this area: new things called "value instantiators" will be allowed to handle feature set similar to that of @JsonCreator; but that need not be bound to existing Java constructors or factory methods.

Friday, July 01, 2011

Oh yes, Jackson 1.8 was released a while ago... :-)

Whoa. Looks like I forgot to blog about something rather big; the release of Jackson 1.8.0. This is mostly because I did that shortly before going for a nice long vacation, and by the time I came back, I had forgotten most of it. Although was reminded by the release in form of a nice long list of new bug reports. :-)

At any rate, I did write something about the release; so as a background, here are contemporary accounts of the event:

Actually, both of above enumerate all the new features, so I can't add much more at detailed level.

But one thing that may not be obvious is that 1.8 was a huge step forward to allow full power of configurability for Jackson Modules; concept which was added in 1.7, but that really became useful enough for major extensions in 1.8. As a result all the exciting modules (esp. ones at FasterXML GitHub) -- including "Jackson XML module" for XML-based serialization, deserialization; Hibernate Jackson module, and even Jackson Scala module -- require 1.8 as their baseline. And at least XML and Scala modules were driving some of improvements made to module interface.

Another main goal was to focus on paying down sort of "feature debt" -- something similar to technical debt, but relating to the fact that sometimes oldest feature-requests tend to be forgotten, and development focuses more on latest ideas. In 1.8, then, we saw completion of many long-standing (and sometimes, long-ignored) feature requests.

Anyway -- it has been a while since 1.8.0 was released; and at this point I am focused on starting work on 1.9.0. While only one new feature has been completed, there is lots of work behind the scenes; much of which is aimed at helping modules (or working on specific modules). But more on these ventures in future blog posts; stay tuned!

Tuesday, May 31, 2011

On proper performance testing of Java JSON processing

Lately I have spent little time writing or worrying about the performance of JSON processing on Java platform. As has been repeatedly pointed out, JSON processing is typically NOT amongst biggest bottlenecks, compared to aspects like database access or HTTP request handling overhead.

But lately there seems to have been bit of renaissance on writing simple Java JSON parsers, and typically these new projects also provide performance tests that seek to prove the superior performance of the offering. Alas, while writing performance tests is not exactly rocket surgery, there are many pitfalls that can trip new performance engineers and testers, rendering many of initial reports misleading at best. And while the results are usually corrected over time, based on feedback, first impressions tend to stick ("but wasn't XYZ the fastest things ever?").

So I figured that since I often end up pointing out issues with these tests, I might as well summarize a list of typical problems that plague new performance benchmarj. Maybe this helps in making the whole process more efficient. At the very least I can just send a URL to point to this entry, as a sort of starting point or FAQ.

With that, here is a collection of common problems with Java JSON processing test suites. Many of these problems are also applicable for related test suites, such as those testing other data formats (XML, binary data formats) to large-scale data processing (map/reduce and other "big data").

1. Missing JVM warmup

For developers on non-managed languages (C/C++ for example), testing is relatively simple: after binary code is loaded in (and perhaps test data from disk), system is ready to measured, and there is little variance between test runs, or need to repeat tests for large number of repetitions.

This is very different for Java, since JVM is very much an adaptive system: while program code is loaded as somewhat abstract bytecode, it will be converted to native code (to have anything close to optimal speed), and this occurs based on measurements that JVM itself does to figure out parts that need to be optimized, on-the-fly. In addition, garbage collection will have impact on performance. To complicate things further the standard class library (JDK classes) is a big thing so that its initialization takes much more time than that of native libraries like libc for C. Trying to test Java libraries same way as C libraries is a recipe for disaster.

What this means is that unless care is taken, measurements that do not account for initial startup overhead may well be just measuring efficiency of JVM at initializing itself, and to some degree complexity of the library being tested (since one-time overhead of more complex, or just bigger, libraries is higher than those of simplest libraries). But what is not being tested is eventual steady-state performance of the library. And since this steady-state is what actually matters most for server-side processes, results will be irrelevant.

So unless you really want to just measure the startup time of an application or library, make sure to run test code for a non-trivial amount of time (at minimum, multiple seconds) first before starting actual measurements. Ideally you should also run tests long enough to get stable measurements. Ideally this steady state would be statistically validated, which is one reason why performance test frameworks are very useful for writing performance tests; typically their authors have solved many of the obvious issues.

Another slightly subtles issue is that the order in which code is loaded (and, over time, dynamically optimized) matters as well: often the test that is run first is best optimized by JVM. This means that optimally different tests (tests for different libraries) should be run on separate JVMs, all "warmed up" running specific test case. This is something that most performance benchmarking frameworks can also help with (I am most familiar with Japex, which already runs separate tests on separate JVMs).

2. Trivial payloads

Another common mistake is that of using trivial data: something that is so small that it:

  • Is unlikely to be used as data for real production systems, and
  • Is so light-weight to process that test mostly checks how much per-invocation overhead library has

In case of JSON, for example, some tests give tiniest data snippest (single String; array with one integer element). Unless your actual use case revolved around such tiny data you probably should not be testing such cases; or at least use wide set of likely input data, to emphasize more common cases.

3. Incorrect input modeling

When testing processing of data formats, data usually comes from outside JVM -- it may read from a storage device or received or sent over network to/from external services. If so, data will arrive as a byte stream of some kind, so the most natural representation is usually java.io.InputSource. Or, if data is length-prefixed, it may be processed by reading it all in a byte array (or ByteBuffer), and offered to library using such abstraction.

But (too) many performance tests start assume that input comes as a java.lang.String. This is peculiar, given that Strings are really things that only live within JVM, and must always be constructed from a byte stream or buffer, decoded from external encoding such UTF-8 or ISO-8859-1. About the only case where input actually arrives as a String are unit tests; or sometimes when another processing components has handled reading and decoding of content.

Now: given that reading and character decoding are generally mandatory steps, how is it actually achieved? Many libraries punt this issue by just declaring that what they accept is just a String (or, sometimes, Reader). This is functionally acceptable as JDK provides simple ways to handle decoding (for example by using java.io.InputStreamReader, or constructing String instances by specifying encoding). But this also happens to be one area where more advanced parsers can optimize processing significantly, by making good use of specific properties of encoding (for example, fact that JSON MUST be encoded using one of only 3 allowed Unicode-based encodings).

So the specific problem of using "too refined input" is two-fold:

  1. It underestimates real overhead of (JSON) processing, by omitting a mandatory decoding step, and
  2. It eliminates performance comparison of one important part of the process, essentially punishing libraries that can do encoding step more efficiently than others.

Effects of decoding overhead are non-trivial: for JSON, it is common that UTF-8 decoding can take nearly as much time as actual tokenization (~= "parsing") of decoded input; which is also why quite a bit of effort has been spent to make parsers more efficient at decoding than general-purpose UTF-8 decoders (such as one that JDK comes equipped with).

4. What am I testing again?

Another common mistake is that of vaguely (or not at all) defining of what exactly is being measured. This actually starts even earlier, by using incorrect terminology for the library: most JSON "parsers" are much more than parsers. In fact, I can not think of a single Java JSON library that was just a parser (or perhaps most accurately, tokenizer): most new JSON processing libraries implement or embed a low-level parser but also bundle a higher-level abstraction (either tree model or data binding -- see "three ways to process JSON" for longer discussion on available processing modes). This is not limited to JSON, by the way; with some other data formats like XML things are even worse: many things called parsers (such as DOM or JDOM) do not even include parser themselves! (instead, they use actual low-level (SAX or Stax) XML parser and then just implement tree model on top of actual parser.

But why does it matter? The basic issue is that sometimes comparison is apples to oranges: for example, comparing a simple streaming parser to a data-binding processor (or, one that provides tree model) is not a fair comparison, given that functionality provided is very different, from user perspective.

Going back to "JSON parser" misnomer: some of the tests choose to test performance for specific processing model -- often Tree Model, probably because the original "org.json parser" only offers this abstraction -- but yet claim it as proof of "parser XXX is the Fastest Java JSON parser!". This is incorrect since it bundles together both low-level parsing (which is most efficient to do with a minimal incremental streaming parser) and building (and possibly manipulation) of a Tree model on top. And to give some idea of relative performance: building of a tree model can take more time than parsing (tokenizing) JSON content -- this is similar to XML processing, where building of a DOM tree typically does take more time (often 2x) than low-level parsing, although JSON tree models are usually much simpler than XML tree models.

The important thing here is that test should clearly explain what is being measured: and in cases where differing approaches are compared, what are the trade-offs.


Last posts


More Ads? Yes Sir!


Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.