Thursday, May 24, 2012

Doing actual non-blocking, incremental HTTP access with async-http-client

Async-http-client library, originally developed at Ning (by Jean-Francois, Tom, Brian and maybe others and since then by quite a few others) has been around for a while now.
Its main selling point is the claim for better scalability compared to alternatives like Jakarta HTTP Client (this is not the only selling points: its API also seems more intuitive).

But although library itself is capable of working well in non-blocking mode, most examples (and probably most users) use it in plain old blocking mode; or at most use Future to simply defer handling of respoonses, but without handling content incrementally when it becomes available.

While this lack of documentation is bit unfortunate just in itself, the bigger problem is that most usage as done by sample code requires reading the whole response in memory.
This may not be a big deal for small responses, but in cases where response size is in megabytes, this often becomes problematic.

1. Blocking, fully in-memory usage

The usual (and potentially problematic) usage pattern is something like:

  AsyncHttpClient asyncHttpClient = new AsyncHttpClient();
  Future<Response> f = asyncHttpClient.prepareGet("http://www.ning.com/ ").execute();
  Response r = f.get();
byte[] contents = r.getResponseBodyAsBytes();

which gets the whole response as a byte array; no surprises there.

2. Use InputStream to avoid buffering the whole entity?

The first obvious work around attempt is to have a look at Response object, and notice that there is method "getResponseBodyAsStream()". This would seemingly allow one to read response, piece by piece, and process it incrementally, by (for example) writing it to a file.

Unfortunately, this method is just a facade, implemented like so:

 public InputStream getResponseBodyAsStream() {
return new ByteArrayInputStream(getResponseBodyAsBytes());
}

which actually is no more efficient than accessing the whole content as a byte array. :-/

(why is it implemented that way? Mostly because underlying non-blocking I/O library, like Netty or Grizzly, provides content using "push" style interface, which makes it very hard to support "pull" style abstractions like java.io.InputStream -- so it is not really AHC's fault, but rather a consequence of NIO/async style of I/O processing)

3. Go fully async

So what can we do to actually process large response payloads (or large PUT/POST request payloads, for that matter)?

To do that, it is necessary to use following callback abstractions:

  1. To handle response payloads (for HTTP GETs), we need to implement AsyncCompletionHandler interface.
  2. To handle PUT/POST request payloads, we need to implement BodyGenerator (which is used for creating a Body instance, abstraction for feeding content)

Let's have a look at what is needed for the first case.

(note: there are existing default implementations for some of the pieces -- but here I will show how to do it from ground up)

4. A simple download-a-file example

Let's start with a simple case of downloading a large file into a file, without keeping more than a small chunk in memory at any given time. This can be done as follows:


public class SimpleFileHandler implements AsyncHandler<File>
{
 private File file;
 private final FileOutputStream out;
 private boolean failed = false;

 public SimpleFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }
}

Voila. Code is not very brief (event-based code seldom is), and it could use some more handling for error cases.
But it should at least show the general processing flow -- nothing very complicated there, beyond basic state machine style operation.

5. Booooring. Anything more complicated?

Downloading a large file is something useful, but while not a contriver example, it is rather plain. So let's consider the case where we not only want to download a piece of content, but also want uncompress it, in one fell swoop. This serves as an example of additional processing we may want to do, in incremental/streaming fashion -- as an alternative to having to store an intermediate copy in a file, then uncompress to another file.

But before showing the code, however, it is necessary to explain why this is bit tricky.

First, remember that we can't really use InputStream-based processing here: all content we get is "pushed" to use (without our code ever blocking with input); whereas InputStream would want to push content itself, possibly blocking the thread.

Second: most decompressors present either InputStream-based abstraction, or uncompress-the-whole-thing interface: neither works for us, since we are getting incremental chunks; so to use either, we would first have to buffer the whole content. Which is what we are trying to avoid.

As luck would have it, however, Ning Compress package (version 0.9.4, specifically) just happens to have a push-style uncompressor interface (aptly named as "com.ning.compress.Uncompressor"); and two implementations:

  1. com.ning.compress.lzf.LZFUncompressor
  2. com.ning.compress.gzip.GZIPUncompressor (uses JDK native zlib under the hood)

So why is that fortunate? Because interface they expose is push style:

 public abstract class Uncompressor
 {
  public abstract void feedCompressedData(byte[] comp, int offset, int len) throws IOException;
  public abstract void complete() throws IOException;
}

and is thereby usable to our needs here. Especially when we use additional class called "UncompressorOutputStream", which makes an OutputStream out of Uncompressor and target stream (which is needed for efficient access to content AHC exposes via HttpResponseBodyPart)

6. Show me the code

Here goes:


public class UncompressingFileHandler implements AsyncHandler<File>,
   DataHandler // for Uncompressor
{
 private File file;
 private final OutputStream out;
 private boolean failed = false;
 private final UncompressorOutputStream uncompressingStream;

 public UncompressingFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   // if compressed, pass through uncompressing stream
   if (uncompressingStream != null) {
    part.writeTo(uncompressingStream);
   } else { // otherwise write directly
    part.writeTo(out);
   }
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (uncompressingStream != null) {
   uncompressingStream.close();
  }
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // must verify that we are getting compressed stuff here:
  String compression = h.getHeaders().getFirstValue("Content-Encoding");
  if (compression != null) {
   if ("lzf".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new LZFUncompressor(this));
   } else if ("gzip".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new GZIPUncompressor(this));
   }
  }
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }

 // DataHandler implementation for Uncompressor; called with uncompressed content:
 public void handleData(byte[] buffer, int offset, int len) throws IOException {
  out.write(buffer, offset, len);
 }
}

Handling gets bit more complicated here, since we have to handle both case where content is compressed; and case where it is not (since server is ultimately responsible for applying compression or not).

And to make call, you also need to indicate capability to accept compressed data. For example, we could define a helper method like:


public File download(String url) throws Exception
{
 AsyncHttpClient ahc = new AsyncHttpClient();
 Request req = ahc.prepareGet(url)
  .addHeader("Accept-Encoding", "lzf,gzip")
  .build();
 ListenableFuture<File> futurama = ahc.executeRequest(req,
new UncompressingFileHandler(new File("download.txt"))); try { // wait for 30 seconds to complete return futurama.get(30, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { throw new IOException("Failed to download due to timeout"); } }

which would use handler defined above.

7. Easy enough?

I hope above shows that while doing incremental, "streaming" processing is bit more work, it is not super difficult to do.

Not even when you have bit of pipelining to do, like uncompressing (or compressing) data on the fly.

Thursday, May 03, 2012

Jackson Data-binding: Did I mention it can do YAML as well?

Note: as useful earlier articles, consider reading "Jackson 2.0: CSV-compatible as well" and "Jackson 2.0: now with XML, too!"

1. Inspiration

Before jumping into the actual beef -- the new module -- I want to mention my inspiration for this extension: the Greatest New Thing to hit Java World Since JAX-RS called DropWizard.

For those who have not yet tried it out and are unaware of its Kung-Fu Panda like Awesomeness, please go and check it out. You won't be disappointed.

DropWizard is a sort of mini-framework that combines great Java libraries (I may be biased, as it does use Jackson), starting with trusty JAX-RS/Jetty8 combination, building with Jackson for JSON, jDBI for DB/JDBC/SQL, Java Validation API (impl from Hibernate project) for data validation, and logback for logging; adding bit of Jersey-client for client-building and optional FreeMarker plug-in for UI, all bundled up in a nice, modular and easily understandable packet.
Most importantly, it "Just Works" and comes with intuitive configuration and bootstrapping system. It also builds easily into a single deployable jar file that contains all the code you need, with just a bit of Maven setup; all of which is well documented. Oh, and the documentation is very accessible, accurate and up-to-date. All in all, a very rare combination of things -- and something that would give RoR and other "easier than Java" frameworks good run for their money, if hipsters ever decided to check out the best that Java has to offer.

The most relevant part here is the configuration system. Configuration can use either basic JSON or full YAML. And as I mentioned earlier, I am beginning to appreciate YAML for configuring things.

1.1. The Specific inspirational nugget: YAML converter

The way DropWizard uses YAML is to parse it using SnakeYAML library, then convert resulting document into JSON tree and then using Jackson for data binding. This is useful since it allows one to use full power of Jackson configuration including annotations and polymorphic type handling.

But this got me thinking -- given that the whole converter implementation about dozen lines or so (to work to degree needed for configs), wouldn't it make sense to add "full support" for YAML into Jackson family of plug-ins?

I thought it would.

2. And Then There Was One More Backend for Jackson

Turns out that implementation was, indeed, quite easy. I was able to improve certain things -- for example, module can use lower level API to keep performance bit better; and output side also works, not just reader -- but in a way, there isn't all that much to do since all module has to do is to convert YAML events into JSON events, and maybe help with some conversions.

Some of more advanced things include:

  • Format auto-detection works, thanks to "---" document prefix (that generator also produces by default)
  • Although YAML itself exposes all scalars as text (unless type hints are enabled, which adds more noise in content), module uses heuristics to make parser implementation bit more natural; so although data-binding can also coerce types, this should usually not be needed
  • Configuration includes settings to change output style, to allow use of more aesthetically pleasing output (for those who prefer "wiki look", for example)

At this point, functionality has been tested with a broad if shallow set of unit tests; but because data-binding used is 100% same as with JSON, testing is actually sufficient to use module for some work.

3. Usage? So boring I tell you

Oh. And you might be interested in knowing how to use the module. This is the boring part, since.... there isn't really much to it.

You just use "YAMLFactory" wherever you would normally use "JsonFactory"; and then under the hood you get "YAMLParser" and "YAMLGenerator" instances, instead of JSON equivalents. And then you either use parser/generator directly, or, more commonly, construct an "ObjectMapper" with "YAMLFactory" like so (code snippet itself is from test "SimpleParseTest.java")


  ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
User user = mapper.readValue("firstName: Billy\n"
+"lastName: Baggins\n"
+"gender: MALE\n"
+"userImage: AQIDBAY=",
User.class);


and to get the functionality itself, Maven dependency is:

<dependency>
  <groupId>com.fasterxml.jackson.dataformat</groupId>
  <artifactId>jackson-dataformat-yaml</artifactId>
  <version>2.0.0</version>
</dependency>

4. That's all Folks -- until you give us some Feedback!

That's it for now. I hope some of you will try out this new backend, and help us further make Jackson 2.0 the "Universal Java Data Processor"

Saturday, April 07, 2012

Java Type Erasure not a Total Loss -- use Java Classmate for resolving generic signatures

As I have written before ("Why 'java.lang.reflect.Type' Just Does Not Cut It"), Java's Type Erasure can be a royal PITA.

But things are actually not quite as bleak as one might think. But let's start with an actual somewhat unsolvable problem; and then proceed with another important, similar, yet solvable problem.

1. Actual Unsolvable problem: Java.util Collections

Here is piece of code that illustrates a problem that most Java developers either understand, or think they understand:

  List<String,Integer> stringsToInts = new ArrayList<String,Integer>();
List<byte[],Boolean> bytesToBools = new ArrayList<byte[], Boolean>();
assertSame(stringsToInts.getclass(), bytesToBools.getClass();

The problem is that although conceptually two collections seem to act different, at source code level, they are instances of the very same class (Java does not generate new classes for genericized types, unlike C++).

So while compiler helps in keeping typing straight, there is little runtime help to either enforce this, or allow other code to deduce expected type; there just isn't any difference from type perspective.

2. All Lost? Not at all

But let's look at another example. Starting with a simple interface


public interface Callable<IN, OUT> {
public OUT call(IN argument);
}

do you think following is true also?


public void compare(Callable<?,?> callable1, Callable<?,?> callable2) {
assertSame(callable1.getClass(), callable2.getClass());
}

Nope. Not necessarilly; classes may well be different. WTH?

The difference here is that since Callable is an interface (and you can not instantiate an interface), instances must be of some other type; and there is a good chance they are different.

But more importantly, if you use Java ClassMate library (more on this in just a bit), we can even figure out parameterization (unlike with earlier example, where all you could see is that parameters are "a subtype of java.lang.Object"), so for example we can do


// Assume 'callable1' was of type:
// class MyStringToIntList implements Callable<String, List<Integer>> { ... }
  TypeResolver resolver = new TypeResolver();
  ResolvedType type = resolver.resolve(callable1.getClass());
  List<ResolvedType> params = type.typeParametersFor(Callable.class);
// so we know it has 2 parameters; from above, 'String' and 'List<Integer>'
assertEquals(2, params.size()); assertSame(String.class, params.get(0).getErasedType();
// and second type is generic itself; in this case can directly access
ResolvedType resultType = params.get(1);
assertSame(List.class, resultType.getErasedType());
List<ResolvedType> listParams = resultType.getTypeParameters();
assertSame(Integer.class, listParams.get(0).getErasedType();
//or, just to see types visually, try:
String desc = type.getSignature(); // or 'getFullDescription'

How is THIS possible? (fun exercise: pick 5 of your favorite Java experts; ask if above is possible, observe how most of them would have said "nope, not a chance" :-) )

3. Long live generics -- hidden deep, deep within

Basically generic type information is actually stored in class definitions, in 3 places:

  1. When defining parent type information ("super type"); parameterization for base class and base interface(s) if any
  2. For generic field declarations
  3. For generic method declarations (return, parameter and exception types)

It is the first place where ClassMate finds its stuff. When resolving a Class, it will traverse the inheritance hierarchy, recomposing type parameterizations. This is a rather involved process, mostly due to type aliasing, ability for interfaces to use different signatures and so on. In fact, trying to do this manually first looks feasible, but if you try it via all wildcarding, you will soon realize why having a library do it for you is a nice thing...

So the important thing to learn is this: to retain run-time generic type information, you MUST pass concrete sub-types which resolve generic types via inheritance.

And this is where JDK collection types bring in the problem (wrt this particular issue): concerete types like ArrayList still take generic parameters; and this is why runtime instances do not have generic type available.

Another way to put this is that when using a subtype, say:


  MyStringList list = new ArrayList<String>() { }
// can use ClassMate now, a la:
ResolvedType type = resolver.resolve(list.getClass());
// type itself has no parameterization (concrete non-generic class); but it does implement List so: List<ResolvedType> params = type.typeParametersFor(List.class);
assertSame(String.class, params.get(0).getErasedType());

which once again would retain usable amount of generic type information.

4. Real world usage?

Above might seem as an academic exercise; but it is not. When designing typed APIs, many callbacks would actually benefit from proper generic typing. And of special interest are callbacks or handlers that need to do type conversions.

As an example, my favorite Database access library, jDBI, makes use of this functionality (using embedded ClassMate) to figure out data-binding information without requiring extra Class argument. That is, you could pass something like (not an actual code sample):

  MyPojo value = dbThingamabob.query(queryString, handler);

instead of what would more commonly requested:

  MyPojo value = dbThingamabob.query(queryString, handler, MyPojo.class);

and framework could still figure out what kind of thing 'handler' would handle, assuming it was a generic interface caller has to implement.

difference may seem minute, but this can actually help a lot by simplifying some aspects of type passing, and remove one particular mode of error.

5. More on ClassMate

Above actually barely scratch surface of what ClassMate provides. Although it is already tricky to find "simple" parameterization for main-level classes, there are much more trickier things. Specifically, resolving types of Fields and Methods (return types, parameters). Given classes like:

  public interface Base<T> {
    public T getStuff();
  }
  public class ListBase<T> implements Base<List<T>> {
protected T value;
protected ListBase(T v) { value = v; }
public T getstuff() { return value; }
} public class Actual implements ListBase<String> {
public Actual(List<String> value) { super(value; }
}

you might be interested in figuring out, exactly what is the type of return value of "getStuff()". By eyeballing, you know it should be "List<String>", but bytecode does not tell this -- in fact, it just tells it's "T", basically.

But with ClassMate you can resolve it:

  // start with ResolvedType; need MemberResolver
  ResolvedType classType = resolver.resolve(Actual.class);
MemberResolver mr = new MemberResolver(resolver);
ResolvedTypeWithMembers beanDesc = mr.resolve(classType, null, null);
ResolvedMethod[] members = bean.getMemberMethods();
ResolvedType returnType = null;
for (ResolvedMethod m : members) {
if ("getStuff".equals(m.getName())) {
returnType = m.getReturnType();
}
}
// so, we should get
assertSame(List.class, returnType.getErasedType());
ResolvedType elemType = returnType.getTypeParameters().get(0);
assertSame(String.class, elemType.getErasedType();

and get the information you need.

6. Why so complicated for nested types?

One thing that is obvious from code samples is that code that uses ClassMate is not as simple as one might hope. Handling of nested generic types, specifically, is bit verbose in some cases (specifically: when type we are resolving does not directly implement type we are interested in)
Why is that?

The reason is that there is a wide variety of interfaces that any class can (and often does) implement. Further, parameterizations may vary at different levels, due to co-variance (ability to override methods with more refined return types). This means that it is not practical to "just resolve it all" -- and even if this was done, it is not in general obvious what the "main type" would be. For these reasons, you need to manually request parameterization for specific generic classes and interfaces as you traverse type hierarchy: there is no other way to do it.

Friday, April 06, 2012

Notes on upgrading Jackson from 1.9 to 2.0

If you have existing code that uses Jackson version 1.x, and you would like to see how to upgrade to 2.0, there isn't much documentation around yet; although Jackson 2.0 release page does outline all the major changes that were made.

So let's try to see what kind of steps are typically needed (note: this is based on Jackson 2.0 upgrade experiences by @pamonrails -- thanks Pierre!)

0. Pre-requisite: start with 1.9

At this point, I assume code to upgrade works with Jackson 1.9, and does not use any deprecated interfaces (many methods and some classes were deprecated during course of 1.x; all deprecated things went away with 2.0). So if your code is using an older 1.x version, the first step is usually to upgrade to 1.9, as this simplifies later steps.

1. Update Maven / JAR dependencies

The first thing to do is to upgrade jars. Depending on your build system, you can either get jars from Jackson Download page, or update Maven dependencies. New Maven dependencies are:

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-annotations</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-core</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.0.0</version> </dependency>

The main thing to note is that instead of 2 jars ("core", "mapper"), there are now 3: former core has been split into separate "annotations" package and remaining "core"; latter contains streaming/incremental parser/generator components. And "databind" is a direct replacement of "mapper" jar.

Similarly, you will need to update dependencies to supporting jars like:

  • Mr Bean: com.fasterxml.jackson.module / jackson-module-mrbean
  • Smile binary JSON format: com.fasterxml.jackson.dataformat / jackson-dataformat-smile
  • JAX-RS JSON provider: com.fasterxml.jackson.jaxrs / jackson-jaxrs-json-provider
  • JAXB annotation support ("xc"): com.fasterxml.jackson.module / jackson-module-jaxb-annotations

these, and many many more extension modules have their own project pages under FasterXML Git repo.

2. Import statements

Since Jackson 2.0 code lives in Java packages, you will need to change import statements. Although most changes are mechanical, there isn't strict set of mappings.

The way I have done this is to simply use an IDE like Eclipse, and remove all invalid import statements; and then use Eclipse functionality to find new packages. Typical import changes include:

  • Core types: org.codehaus.jackson.JsonFactory/JsonParser/JsonGenerator -> com.fasterxml.jackson.core.JsonFactory/JsonParser/JsonGenerator
  • Databind types: org.codehaus.jackson.map.ObjectMapper -> com.fasterxml.jackson.databind.ObjectMapper
  • Standard annotations: org.codehaus.jackson.annotate.JsonProperty -> com.fasterxml.jackson.annotation.JsonProperty

It is often convenient to just use wildcards imports for main categories (com.fasterxml.jackson.core.*, com.fasterxml.jackson.databind.*, com.fasterxml.jackson.annotation.*)

3. SerializationConfig.Feature, DeserializationConfig.Feature

The next biggest change was that of refactoring on/off Features, formerly defined as inner Enums of SerializationConfig and DeserializationConfig classes. For 2.0, enums were moved to separate stand-alone enums:

  1. DeserializationFeature contains most of entries from former DeserializationConfig.Feature
  2. SerializationFeature contains most of entries from former SerializationConfig.Feature

Entries that were NOT moved along are ones that were shared by both, and instead were added into new MapperFeature enumeration, for example:

  • SerializationConfig.Feature.DEFAULT_VIEW_INCLUSION became MapperFeature.DEFAULT_VIEW_INCLUSION

4. Tree model method name changes (JsonNode)

Although many methods (and some classes) were renamed here and there, mostly these were one-offs. But one area where major naming changes were done was with Tree Model -- this because 1.x names were found to be rather unwieldy and unnecessarily verbose. So we decided that it would make sense to try to do a "big bang" name change with 2.0, to get to a clean(er) baseline.

Changes made were mostly of following types:

  • getXxxValue() changes to xxValue(): getTextValue() -> textValue(), getFieldNames() -> fieldNames() and so on.
  • getXxxAsYyy() changes to asYyy(): getValueAsText() -> asText()

5. Miscellaneous

Some classes were removed:

  • CustomSerializerFactory, CustomDeserializerFactory: should instead use Module (like SimpleModule) for adding custom serializers, deserializers

6. What else?

This is definitely an incomplete list. Please let me know what I missed, when you try upgrading!

Thursday, March 29, 2012

Jackson 2.0: CSV-compatible as well

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released"; or, for XML support, see "Not just for JSON any more -- also in XML")

Now that I talked about XML, it is good to follow up with another commonly used, if somewhat humble data format: Comma-Separated Values ("CSV" for friends and foes).

As you may have guessed... Jackson 2.0 supports CSV as well, via jackson-dataformat-csv project, hosted at GitHub

For attention-span-challenged individuals, checkout Project Page: it contains tutorial that can get you started right away.
For others, let's have a slight detour talking through design, so that additional components involved make some sense.

1. In the beginning there was a prototype

After completing Jackson 1.8, I got to one of my wishlist projects: that of being able to process CSV using Jackson. The reason for this is simple: while simplistic and under-specified, CSV is very commonly used for exchanging tabular datasets.
In fact, it (in variant forms, "pipe-delimited", "tab-delimited" etc) may well be the most widely used data format for things like Map/Reduce (Hadoop) jobs, analytics processing pipelines, and all kinds of scripting systems running on Unix.

2. Problem: not "self-describing"

One immediate challenge is that of lacking information on meaning of data, beyond basic division between rows and columns for data. Compared to JSON, for example, one neither necessarily knows which "property" a value is for, nor actual expected type of the value. All you might know is that row 6 has 12 values, expressed as Strings that look vaguely like numbers or booleans.

But then again, sometimes you do have name mapping as the first row of the document: if so, it represents column names. You still don't have datatype declarations but at least it is a start.

Ideally any library that supports CSV reading and writing should support different commonly used variations; from optional header line (mentioned above) to different separators (while name implies just comma, other characters are commonly used, such as tabs and pipe symbol) and possibly quoting/escaping mechanisms (some variants allow backslash escaping).
And finally, it would be nice to expose both "raw" sequence and high-level data-binding to/from POJOs, similar to how Jackson works with JSON.

3. So expose basic "Schema" abstraction

To unify different ways of defining mapping between property names and columns, Jackson now supports general concept of a Schema. While interface itself is little more than a tag interface (to make it possible to pass an opaque type-specific Schema instance through factories), data-format specific subtypes can and do extend functionality as appropriate.

In case of CSV, Schema (use of which is optional -- more on "raw" access later on) defines:

  1. Names of columns, in order -- this is mandatory
  2. Scalar datatypes columns have: these are coarse types, and this information is optional

Note that the reason that type information is strictly optional is that when it is missing, all data is exposed as Strings; and Jackson databinding has extensive set of standard coercions, meaning that things like numbers are conveniently converted as necessary. Specifying type information, then, can help in validating contents and possibly improving performance.

4. Constructing "CSV Schema" objects

How does one get access to these Schema objects? Two ways: build manually, or construct from a type (Class).

Let's start with latter, using same POJO type as with earlier XML example:


  public enum Gender { MALE, FEMALE };
  // Note: MUST ensure a stable ordering; either alphabetic, or explicit
  // (JDK does not guarantee order of properties)
  @JsonPropertyOrder({ "name", "gender", "verified", "image" })
   public class User {
   public Gender gender;
   public String name;
   public boolean verified;
   public byte[] image;
  }
// note: we could use std ObjectMapper; but CsvMapper has convenience methods CsvMapper mapper = new CsvMapper(); CsvSchema schema = mapper.schemaFor(User.class);

or, if we wanted to do this manually, we would do (omitting types, for now):


  CsvSchema schema = CsvSchema.builder()
.addColumn("name") .addColumn("gender")
.addColumn("verified")
.addColumn("image")
.build();

And there is, in fact, the third source: reading it from the header line. I will leave that as an exercise for readers (check the project home page).

Usage is identical, regardless of the source. Schemas can be used for both reading and writing; for writing they are only mandatory if output of the header line is requested.

5. And databinding we go!

Let's consider the case of reading CSV data from file called "Users.csv", entry by entry. Further, we assume there is no header row to use or skip (if there is, the first entry would be bound from that -- there is no way for parser auto-detect a header row, since its structure is no different from rest of data).

One way to do this would be:


  MappingIterator<Entry> it = mapper
.reader(User.class)
.with(schema)
.readValues(new File("Users.csv"());
List<User> users = new ArrayList<User>();
while (it.hasNextValue()) {
User user = it.nextValue();
// do something?
list.add(user);
}
// done! (FileReader gets closed when we hit the end etc)

Assuming we wanted instead to write CSV, we would use something like this. Note that here we DO want to add the explicit header line for fun:


  // let's force use of Unix linefeeds:
ObjectWriter writer = mapper
.writer(schema.withLineSeparator("\n"));
writer.writeValue(new File("ModifiedUsers.csv"), users);

one feature that we took advantage of here is that CSV generator basically ignores any and all array markers; meaning that there is no difference whether we try writing an array, List or just basic sequence of objects.

6. Data-binding (POJOs) vs "Raw" access

Although full data binding is convenient, sometimes we might just want to deal with a sequence of arrays with String values. You can think of this as an alternative to "JSON Tree Model"; an untyped primitive but very flexible data structure.

All you really have to do is to omit definition of the schema (which will then change observe token sequence); and make sure not to enable handling of header line
For this, code to use (for reading) looks something like:


  CsvMapper mapper = new CsvMapper();
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues( "1,null\nfoobar\n7,true\n");
Object[] data = it.nextValue();
assertEquals(2, data.length);
// since we have no schema, everything exposed as Strings, really
assertEquals("1", data[0]);
assertEquals("null", data[1]);

Finally, note that use of raw entries is the only way to deal with data that has arbitrary number of columns (unless you just want to add maximum number of bogus columns -- it is ok to have less data than columns).

7. Sequences vs Arrays

One potential inconvenience with access is that by default CSV is exposed as a sequence of "JSON" Objects. This works if you want to read entries one by one.

But you can also configure parser to expose data as an Array of Objects, to make it convenient to read all the data as a Java array or Collection (as mentioned earlier, this is NOT required when writing data, as array markers have no effect on generation).

I will not go into details, beyond pointing out that the configuration to enable addition "virtual array wrapper" is:


mapper.ensable(CsvParser.Feature.WRAP_AS_ARRAY);

and after this you can bind entries as if they came in as an array: both "raw" ones (Object[][]) and typed (List<User> and so on).

8. Limitations

Compared to JSON, CSV is more limited data format. So does this limit usage of Jackson CSV reader?

Yes. The main limitation is that column values need to essentially be scalar values (strings, numbers, booleans). If you do need more structured types, you will need to work around this, usually by adding custom serializers and deserializers: these can then convert structured types into scalar values and back. However, if you end up doing lots of this kind of work, you may consider whether CSV is the right format for you.

9. Test Drive!

As with all the other JSON alternatives, CSV extension is really looking forward to more users! Let us know how things work.

Saturday, December 10, 2011

Sorting large data sets in Java using Java-merge-sort

When sorting data sets in Java, life is easy if amount of data to process is not huge: JDK has the basic sorting covered well. But if your data is big enough not to fit in memory you are on your own.

This often means that developers use basic Unix 'sort' command line tool. But while it is a good package for basic textual sort -- and when combined with other Unix pipeline tools, on whole range of column-based alternatives -- it is limited in two sometimes crucial aspects:

  1. Defining custom sorting (collation) order is difficult
  2. Interacting with external tools (including 'sort') from within JVM is inherently difficult

But there is one less well-known alternative available: a relatively new Java Open Source library available from Github: java-merge-sort.

1. What is java-merge-sort

Java-merge-sort library implements basic external merge sort, sorting algorithm typically used for disk-backed sorting. Input and output are not limited to files; any java.io.InputStream / java.io.OutputStream implementation will work just fine.

Sorting library is designed to work as an ad-hoc tool (in fact, Jar itself can be used as 'sort' tool) as well as a component of bigger data processing systems.

Notable features include:

  • Fully customizable input and output handlers, used for reading external data into objects to be sorted and writing them back out (handlers defined by providing factories that create instances)
  • Optional custom comparators (if items read do not implement Comparable)
  • Configurable merge factor (number of inputs merged in each pass); max memory usage (which limits length of pre-sort segments -- more memory used, fewer rounds needed)
  • Configurable temporary file handling (defaults to using JDK default temp files, deletions)
  • Ability to cancel sorting jobs asynchronously

2. Using as command-line tool

A simple way to use the library is as a stand-alone command tool; while there are no specific benefits over standard 'sort' command (assuming one is available), it can be used to test functionality. Usage is as simple as:

  java -jar java-merge-sort-[VERSION].jar [input-file]

where 'input-file' is optional (if it is missing, will read from standard input); and sorted output will be displayed to standard output.
Commonly one will then redirect output to a file:

  java -jar java-merge-sort-[VERSION].jar unsorted.txt > sorted.txt

Under the hood, this will run code from class com.fasterxml.sort.std.TextFileSorter

which is both a concrete sorter implementation, and defines main() method to act as a command-line tool.
Sort will be done line-by-line, using basic lexicographic (~= alphabetic) sort which works for common encodings like ASCII, Latin-1 and UTF-8.
Command will limit memory usage to 50% of maximum heap.

3. Simple programmatic usage: textual file sort

More commonly java-merge-sort is used as a component of bigger processing system. So let's have a look at basic usage as 'sort' replacement, i.e. sorting text files.

Code to sort an input file into output file is:

  public void sort(InputFile in, OutputFile out) throws IOException
{
TextFileSorter sorter = new TextFileSorter(new SortConfig().withMaxMemoryUsage(20 * 1000 * 1000)); // use up to 20 megs
sorter.sort(new FileInputStream(in), new FileOutputStream(out));
// note: sort() will close InputStream, OutputStream after sorting
}

which uses default configuration except for maximum memory usage (default is 40 megs: which often works just fine)

4. Advanced usage: sort JSON files

Above example showed one benefit -- easy integration from Java code -- but the real power comes from the fact that we can change input and output handlers to deal with all kinds of data, to support advanced sorting behavior. To demonstrate this, let's consider case where input is a file that contains JSON entries: each line contains a JSON Object like:

{ "firstName" : "Joe", "lastName" : "Plumber", "age":58 }

and which we want to sort primary by age, from lowest to highest, and than by name, alphabetic, first by last name, then by first name.
We can bind this to a Java class like:


  public class Person implements Comparable<Person>
  {
    public int age;
    public String firstName, lastName;

    public int compareTo(Person other) {
     int diff = age - other.age;
     if (diff == 0) {
      diff = lastName.compareTo(other.lastName);
      if (diff == 0) {
       diff = firstName.compareTo(other.firstName);
      }
     }
     return diff;
    }
  }

using Jackson JSON processor, and then sort entries using java-merge-sort.

Code to do this is bit more complicated; let's start with Sorter implementation:


import java.io.*;

import org.codehaus.jackson.JsonGenerator;
import org.codehaus.jackson.map.*;
import org.codehaus.jackson.type.JavaType;

import com.fasterxml.sort.std.StdComparator;

public class JsonPersonSorter extends Sorter<Person>
{
  public JsonFileSorter() throws IOException {
    this(entryType, new SortConfig(), new ObjectMapper());
  }

  public JsonFileSorter(SortConfig config, ObjectMapper mapper) throws IOException {
    this(mapper.constructType(Person.class), config, mapper);
  }

  public JsonFileSorter(JavaType entryType, SortConfig config, ObjectMapper mapper) throws IOException {
    super(config, new ReaderFactory(mapper.reader(entryType)),
      new WriterFactory(mapper),
      new StdComparator<Person>());
  }
}

and supporting reading-related classes are:
public class ReaderFactory extends DataReaderFactory<Person>
{
  private final ObjectReader _reader;
  public ReaderFactory(ObjectReader r) {
    _reader = r;
  }

  @Override
  public DataReader<Person> constructReader(InputStream in) throws IOException {
    MappingIterator<Person> it = _reader.readValues(in);
    return new Reader<Person>(it);
  }
}

public class Reader<E> extends DataReader<E>
{
  protected final MappingIterator<E> _iterator;
 
  public Reader(MappingIterator<E> it) {_i terator = it; }

  @Override
  public E readNext() throws IOException {
    if (_iterator.hasNext()) {
      return _iterator.nextValue();
    }
    return null;
  }

// not a good estimation, has to do for now (should count String lengths, estimate) @Override public int estimateSizeInBytes(E item) { return 100; } @Overridepu blic void close() throws IOException { } // auto-closes when we reach end }


and writing-related classes:
static class WriterFactory<W> extends DataWriterFactory<W>
{
  protected final ObjectMapper _mapper;

  public WriterFactory(ObjectMapper m) {
    _mapper = m;
  }

  @Override
  public DataWriter<W> constructWriter(OutputStream out) throws IOException {
    return new Writer<W>(_mapper, out);
  }
}

static class Writer<E> extends DataWriter<E>
{
  protected final ObjectMapper _mapper;
  protected final JsonGenerator _generator;

  public Writer(ObjectMapper mapper, OutputStream out) throws IOException {
    _mapper = mapper;
    _generator = _mapper.getJsonFactory().createJsonGenerator(out);
  }

  @Override
  public void writeEntry(E item) throws IOException {
    _mapper.writeValue(_generator, item);
    // not 100% necesary, but for readability, add linefeeds
    _generator.writeRaw('\n');
  }

  @Override
  public void close() throws IOException {
    _generator.close();
  }
}

So with all of above, we could sort a file using: JsonFileSorter sorter = new JsonFileSorter(); sorter.sort(inputFile, outputFile);

Which is pretty much identical to earlier code to sort a File; just with different reader+writer configuration.

5. Even more advanced: compress intermediate files?

There are many ways to customize processing; and one interesting idea is to actually compress intermediate files (results of pre-sort, inputs to later merge rounds); preferably using ultra-fast Java compressor like Ning LZF.

Code to do this would not be long -- it's just matter of changing DataReaderFactory and DataWriterFactory to read/write files -- but I will leave this up as an exercise to reader. :-)

6. More speed: configurations

There are two main configuration switches that can be used to improve speed:

  1. Amount of memory used for pre-sorting: more memory to use, fewer sorted segments are needed -- in fact, it may be possible to do the whole sort in memory. Default memory to use is 40 megabytes (to accomodate for default JDK max heap size of 64 megs)
  2. Number of inputs merged per round: default is 16 inputs, which should be enough; but you can increase this to reduce number of merge rounds needed (or reduce if you want to minimize number of open files, in case you encounter problems)

7. Future ideas

Looking at JSON sorting code, I realize that it would be easy to create a generic sorter that uses Jackson. And not only would this support sorting JSON files, but also files that use any other format Jackson supports, such as Smile (out of the box, with 'SmileFactory'), XML, CSV and BSON!

Wednesday, September 28, 2011

Advanced filtering with Jackson, Json Filters

I wrote a bit earlier on "filtering properties with Jackson". While it was comprehensive in that all main methods of filtering were covered, there wasn't much depth. Specifically, only very basic usage of Json Filters (@JsonFilter annotation, SimpleFilterProvider as provider) was considered. This approach does allow more dynamic filtering than, say, @JsonView, but it is still somewhat limited. So let's consider more advanced customizability.

1. Refresher on Json Filters

Ok, so the basic idea with Json Filters is that:

  1. Classes can have an associated Filter Id, which defines logical filter to use.
  2. A provider is needed to get the actual filter instance to use, given id: this will be configured by assigning a FilterProvider (such as 'SimpleFilterProvider') to ObjectMapper or ObjectWriter.
  3. Jackson will dynamically (and efficiently) resolve filter given class uses, dynamically, allowing per-call reconfiguration of filtering.

From this it is clear that there are 2 main things you can configure: mechanism that is used to find Filter id of a given class, and mechanism used for mapping this id to actual filter used (implementation of which can be as complicated as you want).

So let's have a look at both parts.

2. Configuring mapping from id to filter instance

Of mechanisms, latter one may be easier to understand and use: one just has to implement 'FilterProvider', which has but one method to implement:

  public abstract class FilterProvider {
    public abstract BeanPropertyFilter findFilter(Object filterId);
  }

given this, 'SimpleFilterProvider' is little more than a Map<String,BeanPropertyFilter>, except for adding couple of convenience factory methods that build 'SimpleBeanPropertyFilter' instances given property names, so you typically just instantiate one with calls like:

  SimpleBeanPropertyFilter filter = SimpleBeanPropertyFilter.filterOutAllExcept("a"));

which would out all properties except for one named "a". This filter is then configured with ObjectMapper like so:

  FilterProvider fp = new SimpleFilterProvider().addFilter("onlyAFilter", filter);
  objectMapper.writer(fp).writeValueAsString(pojo);

which would, then, apply to any Java type configured to use filter with id "onlyAFilter".

3. Configuring discovery of filter id

From above example we know we need to indicate classes that are to use our "onlyAFilter". The default mechanism is to use:

  @JsonFilter("onlyAFilter")
  public class FilteredPOJO {
    //...
  }

But this is just the default. How so? The way Jackson figures out its annotation-based configuration is actually indirect, and fully customizable: all interaction is through configured 'AnnotationIntrospector' object, which amongst other things defines this method:

  public Object findFilterId(AnnotatedClass ac);

which is called when serializer needs to determine id of the filter to apply (if any) for given class. Since the default implementation (org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector) has everything else working fine, what we can do is to sub-class it and override this method.
For example:

  public class MyFilteringIntrospector extends JacksonAnnotationIntrospector
  {
    @Override
    public Object findFilterId(AnnotatedClass ac) {
      // First, let's consider @JsonFilter by calling superclass
      Object id = super.findFilterId(ac);
      // but if not found, use our own heuristic; say, just use class name as filter id, if there's "Filter" in name:
      if (id == null) {
        String name = ac.getName();
        if (name.indexOf("Filter") >= 0) {
          id = name;
        }
      }
      return id;
    }
  }

Above functionality is just to show what is possible, not that it makes sense. Alternatively you could of course define your own annotations to check; or have List of known class names, check class definition or interfaces type implements. The main point is just that you are not limited to using @JsonFilter annotation, but can use pretty much any logic you want, within limits of your coding skills.

The only caveat is that the resolution from Class to matching id is only guaranteed to be called once per ObjectMapper; so any variation in filtering of specific class needs to happen at either mapping of id to filter, or within filter itself.

4. Don't be afraid of sub-classing (Jackson)AnnotationIntrospector

Actually, the key take away might as well be the fact that AnnotationIntrospector is designed to be customizable. It was initially created to allow easy reuse of JAXB annotations (via JAXBAnnotationIntrospector; combining things with AnnotationIntrospector.Pair); but it is also a very powerful general-purpose customization mechanism. But at this point quite underused one at that.

5. Addendum

Some additional notes based on feedback I received:

  • Custom BeanPropertyFilter implementations are obviously powerful too: not only can they completely change what (if anything) gets written for property, they can base this on all configuration accessible via SerializerProvider which is passed to serializeAsField(): for example, it can check to see what serialization view is available by calling 'provider.getSerializationView()'.

Monday, April 04, 2011

Introducing "jvm-compressor-benchmark" project

I recently started one new open source project; this time being inspired by success of another OS project I had been involved in, project is "jvm-serializers" benchmark originally started by Eishay and built by a community of java databinder/serializer experts. What has been great with this project has been amount of energy it seemed ot feed back to development of serializers: highly visible competition for best performance seems to have improved efficiency of libraries a lot. I only wish we had historical benchmark data to compare to see exactly how far have the fastest Java serializers come.

Anyway, I figured that there are other groups of libraries where high performance matters, but where there is lack of actual solid benchmarking information. So while there are a few compression performance benchmarks, they are often non-applicable for Java developers: partly because they just compare native compressor codecs, and partly because focus is more often only on space-efficiency (how much compression is achieved) with little consideration of performance of compression. The last part is particularly frustrating as in many use cases there is significant trade-off between space and time efficiency (compression rate vs time used for compression).

So, this is where the new project -- "jvm-compressor-benchmark" -- comes from. I hope it will allow fair comparison of compression codecs available on JVM, to be used by Java and other JVM lagnuages; and also bring in some friendly competition between developers of compression codecs.

First version compares half a dozen of compression formats and codecs, from the venerable deflate/gzip (which offers pretty good compression ratio with decent speed) to higher-compression-but-slower-operation alternatives (bzip2) and lower-compression-but-very-fast alternatives like lzf, quicklz and the new kid on the block, Snappy (via JNI).

And although the best way to evaluate results is to run tests on your machine, using data sets you care about (which I strongly encourage!), Project wiki does have some pretty diagrams for tests run on "standard" data sets gathered from the web.

Anyway: please check the project out -- at the very least it should give you an idea of how many options there are above and beyond basic JDK-provided gzip.

ps. Contributions are obviously also welcome -- anyone willing to tackle Java version of 7-zip's LZMA, for example, would be most welcome!

Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

  1. It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
  2. It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (java.util.zip.Deflater).

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.


  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
feeder.endOfInput(); } } // and once we have full event, we just dump out event type (for now) System.out.println("Got event of type: "+type); // could also just copy event as is, using Stax, or do any other normal non-blocking handling: // xmlStreamWriter.copyEventFromReader(asyncReader, false); } while (type != END_DOCUMENT); asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

  1. Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
  2. Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
  3. Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

  • Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
  • There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

Monday, February 28, 2011

Jackson: not just for JSON, Smile or BSON any more -- Now With XML, too!

(NOTE: see the newer article on "Jackson 2.0 with XML")

One of first significant new Jackson extension projects (result of Jackson 1.7 release which made it much easier to provide modular extensions) is jackson-xml-databind, hosted at GitHub. Although this extension is still in its pre-1.0 development phase, the latest released version is fully usable as is and is even in some limited production use by some brave developers (running on Google AppEngine, of all things!).

So it is probably a good idea to now give a brief overview of what this project is all about.

1. What is jackson-xml-databind?

Jackson-xml-databind comes in a small package (jar is only about 55 kB) , and is used with Jackson data binding functionality (jackson-mapper jar). It provides basic replacement for JsonFactory, JsonParser and JsonGenerator components of Jackson Streaming API, and allows reading and writing of XML instead of JSON, in context of generic Jackson data binding functionality. In addition, core ObjectMapper is also sub-classed to provide customized versions of couple of other provider types, so typically all usage is done by creating com.fasterxml.jackson.xml.XmlMapper instead of ObjectMapper, and using it for data binding.

2. What is it used for?

This package is used to read XML and convert it to POJOs, as well as to write POJOs as XML. In this respect it is very similar to JAXB (javax.xml.bind) package; and an alternative for many other Java XML data binding packages such as XStream and JibX. Given Jackson support for JAXB annotations, it can be especially conveniently used as a JAXB replacement in many cases.

Functionality supported is in some ways a subset of JAXB, and in other ways a superset: XML-specific functionality is more limited (no explicit support for XML Schema), but general data binding functionality is arguably more powerful (since it is full set of Jackson functionality).

Two obvious benefits of this package compared to JAXB or other existing XML data binding solutions (like XStream) are superior performance -- with fast Stax XML parser, this is likely the fastest data binding solution on Java platform (see jvm-serializers for results) -- and extensive and customizable data POJO conversion functionality, using all existing Jackson annotations and configuration options. The main downside currently is potential immaturity of the package; however, this only applies to interaction between mature XML packages (stax implementation) and Jackson data binder (which is also fairly mature at this point).

3. So how do I use it?

If you know how to use Jackson with JSON, you know almost everything you need to use this package. The only other thing you need to know is that there has to be a Stax XML parser/generator implementation available. While JDK 1.6 provides one implementation, your best best is using something bit more efficient, such as Woodstox or Aalto. Both should work fine; Aalto is faster of two, but Woodstox is a more mature choice. So you will probably want to include one of these Stax implementations when using jackson-xml-databind.

Other than this, all you need to do is to construct XmlMapper:

  XmlMapper mapper = new XmlMapper(); // can also specify XmlFactory to 
  use, to override Stax factories used

and use it like you would any other ObjectMapper, like so:

  User user = new User(); // from Jackson-in-five-minutes sample
String xml = mapper.writeValueAsString(user);

and what you would get is something like:

<User>
  <name>
    <first>Joe</first>
    <last>Sixpack</last>
  </name>
  <verified>true</verified>
  <gender>MALE</gender>
  <userImage>AQIDBAU=</userImage>
</User>

which is equivalent of JSON serialization that would look like:

{
  "name":{
    "first":"Joe",
    "last":"Sixpack"
  },
  "verified":true,
  "gender":"MALE",
  "userImage":"AQIDBAU="
}

Pretty neat eh?

Oh, and reverse direction obviously works similarly:

  User user = mapper.readValue(xml, User.class);

There is really nothing extra-ordinary in it usage; just another way to use Jackson for slicing and dicing your POJOs.

4. Limitations

While existing version works pretty well in general, there are some limitations. These mostly stem from the basic difference between XML and JSON logical models; and specifically affect handling of Lists/arrays. XmlMapper for example only allows so-called "wrapped" lists (for now); meaning that there is one wrapper XML element for each List or array property, and separate element for each List item.

Compared to JAXB (and related to JAXB annotation support), no DOM support is included; meaning, it is not possible to use converters that take or produce DOM Elements.

With respect to Jackson functionality, while polymorphic type information does work, some combinations of settings may not work as expected.

And given project's pre-1.0 status, testing is not yet as complete as it needs to be, so other rough edges may also be found. But with help of user community I am sure we can polish these up pretty quickly.

5. Feedback time!

So what is needed most at this point? Users, usage, and resulting bug (or, possibly, success) reports! Seriously, more usage there is, faster we can get the project up to 1.0 release.

Happy hacking!

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.