Monday, April 09, 2012

Data format auto-detection with Jackson (JSON, XML, Smile, YAML)

There is one fairly advanced feature of Jackson that has been around a while (since version 1.8), but that has not really been publicized a lot: data format auto-detection. Let's see how it works, and what it could be used for.

1. Format detection?

By format detection I mean ability to figure out most likely data format that a piece of content has. Auto-detection means that a piece of code can try to automatically deduce this, given set of data formats to recognize, and accessor to content.

Jackson 1.8 added such capability to Jackson, by adding one new method in JsonFactory abstract class:

  public MatchStrength hasFormat(InputAccessor acc)

as well as couple of supporting classes; and most importantly, a helper class:


that coordinates calls to produce somewhat convenience mini-API for format auto-detection.

2. Show Me Some Code!

Let's start with a simple demonstration, with known content that should be either JSON or XML:

  JsonFactory jsonF = new JsonFactory();
XmlFactory xmlF = new XmlFactory(); // from com.fasterxml.jackson.dataformat.xml (jackson-dataformat-xml)
// note: ordering is importtant; first one that gives full match is chosen:
DataFormatDetector det = new DataFormatDetector(new JsonFactory[] { jsonF, xmlF });
// let's accept about any match; but only if no "solid match" found det = det.withMinimalMatch(MatchStrength.WEAK_MATCH).withOptimalMatch(MatchStrength.SOLID_MATCH);
// then see what we get:
DataFormatMatcher match = det.findFormat("{ \"name\" : \"Bob\" }".getBytes("UTF-8")); assertEquals(jsonF.getFormatName(), match.getMatchedFormatName());
match = det.findFormat("<?xml version='1.0'?><root/>".getBytes("UTF-8"));
assertEquals(xmlF.getFormatName(), match.getMatchedFormatName();
// or:
match = det.findForm("neither really...".getBytes("UTF-8"));

which is useful if we want to display information; but perhaps even more useful, we can conveniently process the data.
So let's assume we have file "data", with format of either XML or JSON:

  // note: can pass either byte[] or InputStream
  match = det.findFormat(new File("data"));
JsonParser p = match.createParserWithMatch();
// or; if we wanted to get factory: JsonFactory matchedFactory = p.getMatch();
ObjectMapper mapper = new ObjectMapper();
User user = mapper.readValue(p, User.class);

Basically you can let DataFormatMatcher construct a parser for the matched type (note: some data formats require specific kind of ObjectMapper to be used).

3. Works on... ?

Basically, any format for which there is JsonFactory that properly implements method "hasFormat()" can be auto-detected.

Currently (Jackson 2.0.0) this includes following data formats:

  1. JSON -- can detect standards-compliant data (main-level JSON Object or Array); and to some degree other variants (scalar values at root-level)
  2. Smile -- reliably detected, especially when the standard header is written (enabled by default)
  3. XML -- reliably detected either from XML declaration, or from first tag, PI or comment
  4. YAML: experimental Jackson YAML module can detect document start marker ("---") for reliable detection; otherwise inconclusive

One existing dataformat for which auto-detection does not yet work is CSV: this is mostly due to inherent lack of header of any kind. However, some heuristic support will likely be added soon.

4. Most useful for?

This feature was originally implemented to allow for automatic detection and parsing of content that would be in either JSON, or a binary JSON (Smile) representation. For this use case, things work reliably and efficiently.

But fortunately system was designed to be pluggable, so it should actually work for a variety of other cases. Ideally this should nicely complement "universal data adapter" goal of Jackson project; so that you could usually simply just feed a data file, and as long as it is in one of supported formats, things would Just Work.

5. Caveats

Some things to note:

  1. Order of factories used for constructing DataFormatDetector matters: first one that provides optimal match is taken; and if no optimal match is found, first of otherwise equal acceptable matches is given
  2. Some data formats require specific ObjectMapper implementation (sub-class) to be used: for those formats, automatic parser creation needs to be coupled with choosing of the right mapper (this may be improved in future)

blog comments powered by Disqus

Sponsored By

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.