Saturday, August 18, 2012

Replacing standard JDK serialization using Jackson (JSON/Smile), java.io.Externalizable

1. Background

The default Java serialization provided by JDK is a two-edged sword: on one hand, it is a simple, convenient way to "freeze and thaw" Objects you have, handling about any kind of Java object graphs. It is possibly the most powerful serialization mechanism on Java platform, bar none.

But on the other hand, its shortcomings are well-document (and I hope, well-known) at this point. Problems include:

  • Poor space-efficiency (especially for small data), due to inclusion of all class metadata: that is, size of output can be huge, larger than about any alternative, including XML
  • Poor performance (especially for small data), partly due to size inefficiency
  • Brittleness: smallest changes to class definitions may break compatibility, preventing deserialization. This makes it a poor choice for both data exchange between (Java) systems as well as long-term storage

Still, the convenience factor has led to many systems using JDK serialization to be the default serialization method to use.

Is there anything we could do to address downsides listed above? Plenty, actually. Although there is no way to do much more for the default implementation (JDK serialization implementation is in fact ridiculously well optimized for what it tries to achieve -- it's just that the goal is very ambitious), one can customize what gets used by making objects implement java.io.Externalizable interface. If so, JDK will happily use alternate implementation under the hood.

Now: although writing custom serializers may be fun sometimes -- and for specific case, you can actually write very efficient solution as well, given enough time -- it would be nice if you could use an existing component to address listed short-comings.

And that's what we'll do! Here's one possible way to improve on all problems listed above:

  1. Use an efficient Jackson serializer (to produce either JSON, or perhaps more interestingly, Smile binary data)
  2. Wrap it in nice java.io.Externalizable, to make it transparent to code using JDK serialization (albeit not transparent for maintainers of the class -- but we will try minimizing amount of intrusive code)

2. Challenges with java.io.Externalizable

First things first: while conceptually simple, there are couple of rather odd design decisions that make use of java.io.Externalizable bit tricky:

  1. Instead of passing instances of java.io.InputStream, java.io.OutputStream, instead java.io.ObjectOutput and java.io.ObjectInput are used; and they do NOT extend stream versions (even though they define mostly same methods!). This means additional wrapping is needed
  2. Externalizable.readExternal() requires updating of the object itself, not that of constructing new instances: most serialization frameworks do not support such operation
  3. How to access external serialization library, as no context is passed to either of methods?

These are not fundamental problems for Jackson: first one requires use of adapter classes (see below), second that we need to use "updating reader" approach that Jackson was supported for a while (yay!). And to solve the third part, we have at least two choices: use of ThreadLocal for passing an ObjectMapper; or, use of a static helper class (approach shown below)

So here are the helper classes we need:

final static class ExternalizableInput extends InputStream
{
  private final ObjectInput in;

  public ExternalizableInput(ObjectInput in) {
   this.in = in;
  }

  @Override
  public int available() throws IOException {
    return in.available();
  }

  @Override
  public void close() throws IOException {
    in.close();
  }

  @Override
  public boolean  markSupported() {
    return false;
  }

  @Override
  public int read() throws IOException {
   return in.read();
  }

  @Override
  public int read(byte[] buffer) throws IOException {
    return in.read(buffer);
  }

  @Override
  public int read(byte[] buffer, int offset, int len) throws IOException {
    return in.read(buffer, offset, len);
  }

  @Override
  public long skip(long n) throws IOException {
   return in.skip(n);
  }
}

final static class ExternalizableOutput extends OutputStream { private final ObjectOutput out; public ExternalizableOutput(ObjectOutput out) { this.out = out; } @Override public void flush() throws IOException { out.flush(); } @Override public void close() throws IOException { out.close(); } @Override public void write(int ch) throws IOException { out.write(ch); } @Override public void write(byte[] data) throws IOException { out.write(data); } @Override public void write(byte[] data, int offset, int len) throws IOException { out.write(data, offset, len); } }

/* Use of helper class here is unfortunate, but necessary; alternative would
* be to use ThreadLocal, and set instance before calling serialization.
* Benefit of that approach would be dynamic configuration; however, this
* approach is easier to demonstrate.
*/
class MapperHolder { private final ObjectMapper mapper = new ObjectMapper(); private final static MapperHolder instance = new MapperHolder(); public static ObjectMapper mapper() { return instance.mapper; } }

and given these classes, we can implement Jackson-for-default-serialization solution.

3. Let's Do a Serialization!

So with that, here's a class that is serializable using Jackson JSON serializer:


  static class MyPojo implements Externalizable
  {
        public int id;
        public String name;
        public int[] values;

        public MyPojo() { } // for deserialization
        public MyPojo(int id, String name, int[] values)
        {
            this.id = id;
            this.name = name;
            this.values = values;
        }

        public void readExternal(ObjectInput in) throws IOException {
            MapperHolder.mapper().readerForUpdating(this).readValue(new ExternalizableInput(in));
} public void writeExternal(ObjectOutput oo) throws IOException { MapperHolder.mapper().writeValue(new ExternalizableOutput(oo), this); }
}

to use that class, use JDK serialization normally:


  // serialize as bytes (to demonstrate):
MyPojo input = new MyPojo(13, "Foobar", new int[] { 1, 2, 3 } ); ByteArrayOutputStream bytes = new ByteArrayOutputStream(); ObjectOutputStream obs = new ObjectOutputStream(bytes); obs.writeObject(input); obs.close(); byte[] ser = bytes.toByteArray();

// and to get it back:
ObjectInputStream ins = new ObjectInputStream(new ByteArrayInputStream(ser)); MyPojo output = (MyPojo) ins.readObject();
ins.close();

And that's it.

4. So what's the benefit?

At this point, you may be wondering if and how this would actually help you. Since JDK serialization is using binary format; and since (allegedly!) textual formats are generally more verbose than binary formats, how could this possibly help with size of performance?

Turns out that if you test out code above and compare it with the case where class does NOT implement Externalizable, sizes are:

  • Default JDK serialization: 186 bytes
  • Serialization as embedded JSON: 130 bytes

Whoa! Quite unexpected result? JSON-based alternative 30% SMALLER than JDK serialization!

Actually, not really. The problem with JDK serialization is not the way data is stored, but rather the fact that in addition to (compact) data, much of Class definition metadata is included. This metadata is needed to guard against Class incompatibilities (which it can do pretty well), but it comes with a cost. And that cost is particularly high for small data.

Similarly, performance typically follows data size: while I don't have publishable results (I may do that for a future post), I expect embedded-JSON to also perform significantly better for single-object serialization use cases.

5. Further ideas: Smile!

But perhaps you think we should be able to do better, size-wise (and perhaps performance) than using JSON?

Absolutely. Since the results are not exactly readable (to use Externalizable, bit of binary data will be used to indicate class name, and little bit of stream metadata), we probably do not greatly care what the actual underlying format is.
With this, an obvious choice would be to use Smile data format, binary counterpart to JSON, a format that Jackson supports 100% with Smile Module.

The only change that is needed is to replace the first line from "MapperHolder" to read:

private final ObjectMapper mapper = new ObjectMapper(new SmileFactory());

and we will see even reduced size, as well as faster reading and writing -- Smile is typically 30-40% smaller in size, and 30-50% faster to process than JSON.

6. Even More compact? Consider Jackson 2.1, "POJO as array!"

But wait! In very near future, we may be able to do EVEN BETTER! Jackson 2.1 (see the Sneak Peek) will introduce one interesting feature that will further reduce size of JSON/Smile Object serialization. By using following annotation:

@JsonFormat(shape=JsonFormat.Shape.OBJECT)

you can further reduce the size: this occurs as the property names are excluded from serialization (think of output similar to CSV, just using JSON Arrays).

For our toy use case, size is reduced further from 130 bytes to 109; further reduction of almost 20%. But wait! It gets better -- same will be true for Smile as well, since while it can reduce space in general, it still has to retain some amount of name information normally; but with POJO-as-Arrays it will use same exclusion!

7. But how about actual real-life results?

At this point I am actually planning on doing something based on code I showed above. But planning is in early stages so I do not yet have results from "real data"; meaning objects of more realistic sizes. But I hope to get that soon: the use case is that of storing entities (data for which is read from DB) in memcache. Existing system is getting CPU-bound both from basic serialization/deserialization activity, but especially from higher number of GCs. I fully expect the new approach to help with this; and most importantly, be quite easy to deploy: this because I do not have to change any of code that actually serializes/deserializes Beans -- I just have to modify Beans themselves a bit.

Forcing escaping of HTML characters (less-than, ampersand) in JSON using Jackson

1. The problem

Jackson handles escaping of JSON String values in minimal way using escaping where absolutely necessary: it escapes two characters by default -- double quotes and backslash -- as well as non-visible control characters. But it does not escape other characters, since this is not required for producing valid JSON documents.

There are systems, however, that may run into problems with some characters that are valid in JSON documents. There are also use cases where you might prefer to add more escaping. For example, if you are to enclose a JSON fragment in XML attribute (or Javascript code), you might want to use apostrophe (') as quote character in XML, and force escaping of all apostrophes in JSON content; this allows you to simple embed encoded JSON value without other transformations.

Another specific use case is that of escaping "HTML funny characters", like less-than, greater-than, ampersand and apostrophe characters (double-quote are escaped by default).

Let's see how you can do that with Jackson.

2. Not as easy to change as you might think

Your first thought may be that of "I'll just do it myself". The problem is two-fold:

  1. When using API via data-binding, or regular Streaming generator, you must pass unescaped String, and it will get escaped using Jackson's escaping mechanism -- you can not pre-process it (*)
  2. If you decide to post-process content after JSON gets written, you need to be careful with replacements, and this will have negative impact on performance (i.e. it is likely to double time serialization takes)

(*) actually, there is method 'JsonGenerator.writeRaw(...)' which you can use to force exact details, but its use is cumbersome and you can easily break things if you are not careful. Plus it is only applicable via Streaming API

3. Jackson (1.8) has you covered

Luckily, there is no need for you to write custom post-processing code to change details of content escaping.

Version 1.8 of Jackson added a feature to let users customize details of escaping of characters in JSON String values.
This is done by defining a CharacterEscapes object to be used by JsonGenerator; it is registered on JsonFactory. If you use data-binding, you can set this by using ObjectMapper.getJsonFactory() first, then define CharacterEscapes to use.

Functionality is handled at low-level, during writing of JSON String values; and CharacterEscapes abstract class is designed in a way to minimize performance overhead.
While there is some performance overhead (little bit of additional processing is required), it should not have significant impact unless significant portion of content requires escaping.
As usual, if you care a lot about performance, you may want to measure impact of the change with test data.

4. The Code

Here is a way to force escaping of HTML "funny characters", using functionality Jackson 1.8 (and above) have.


import org.codehaus.jackson.SerializableString;
import org.codehaus.jackson.io.CharacterEscapes;

// First, definition of what to escape public class HTMLCharacterEscapes extends CharacterEscapes { private final int[] asciiEscapes; public HTMLCharacterEscapes() {
// start with set of characters known to require escaping (double-quote, backslash etc) int[] esc = CharacterEscapes.standardAsciiEscapesForJSON();
// and force escaping of a few others: esc['<'] = CharacterEscapes.ESCAPE_STANDARD; esc['>'] = CharacterEscapes.ESCAPE_STANDARD; esc['&'] = CharacterEscapes.ESCAPE_STANDARD; esc['\''] = CharacterEscapes.ESCAPE_STANDARD; asciiEscapes = esc; }
// this method gets called for character codes 0 - 127 @Override public int[] getEscapeCodesForAscii() { return asciiEscapes; }
// and this for others; we don't need anything special here @Override public SerializableString getEscapeSequence(int ch) { // no further escaping (beyond ASCII chars) needed: return null; } }

// and then an example of how to apply it
public ObjectMapper getEscapingMapper() {
ObjectMapper mapper = new ObjectMapper();
mapper.getJsonFactory().setCharacterEscapes(new HTMLCharacterEscapes());
return mapper;
}

// so we could do:
public byte[] serializeWithEscapes(Object ob) throws IOException
{
return getEscapingMapper().writeValueAsBytes(ob);
}


And that's it.



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.