Saturday, August 18, 2012

Replacing standard JDK serialization using Jackson (JSON/Smile), java.io.Externalizable

1. Background

The default Java serialization provided by JDK is a two-edged sword: on one hand, it is a simple, convenient way to "freeze and thaw" Objects you have, handling about any kind of Java object graphs. It is possibly the most powerful serialization mechanism on Java platform, bar none.

But on the other hand, its shortcomings are well-document (and I hope, well-known) at this point. Problems include:

  • Poor space-efficiency (especially for small data), due to inclusion of all class metadata: that is, size of output can be huge, larger than about any alternative, including XML
  • Poor performance (especially for small data), partly due to size inefficiency
  • Brittleness: smallest changes to class definitions may break compatibility, preventing deserialization. This makes it a poor choice for both data exchange between (Java) systems as well as long-term storage

Still, the convenience factor has led to many systems using JDK serialization to be the default serialization method to use.

Is there anything we could do to address downsides listed above? Plenty, actually. Although there is no way to do much more for the default implementation (JDK serialization implementation is in fact ridiculously well optimized for what it tries to achieve -- it's just that the goal is very ambitious), one can customize what gets used by making objects implement java.io.Externalizable interface. If so, JDK will happily use alternate implementation under the hood.

Now: although writing custom serializers may be fun sometimes -- and for specific case, you can actually write very efficient solution as well, given enough time -- it would be nice if you could use an existing component to address listed short-comings.

And that's what we'll do! Here's one possible way to improve on all problems listed above:

  1. Use an efficient Jackson serializer (to produce either JSON, or perhaps more interestingly, Smile binary data)
  2. Wrap it in nice java.io.Externalizable, to make it transparent to code using JDK serialization (albeit not transparent for maintainers of the class -- but we will try minimizing amount of intrusive code)

2. Challenges with java.io.Externalizable

First things first: while conceptually simple, there are couple of rather odd design decisions that make use of java.io.Externalizable bit tricky:

  1. Instead of passing instances of java.io.InputStream, java.io.OutputStream, instead java.io.ObjectOutput and java.io.ObjectInput are used; and they do NOT extend stream versions (even though they define mostly same methods!). This means additional wrapping is needed
  2. Externalizable.readExternal() requires updating of the object itself, not that of constructing new instances: most serialization frameworks do not support such operation
  3. How to access external serialization library, as no context is passed to either of methods?

These are not fundamental problems for Jackson: first one requires use of adapter classes (see below), second that we need to use "updating reader" approach that Jackson was supported for a while (yay!). And to solve the third part, we have at least two choices: use of ThreadLocal for passing an ObjectMapper; or, use of a static helper class (approach shown below)

So here are the helper classes we need:

final static class ExternalizableInput extends InputStream
{
  private final ObjectInput in;

  public ExternalizableInput(ObjectInput in) {
   this.in = in;
  }

  @Override
  public int available() throws IOException {
    return in.available();
  }

  @Override
  public void close() throws IOException {
    in.close();
  }

  @Override
  public boolean  markSupported() {
    return false;
  }

  @Override
  public int read() throws IOException {
   return in.read();
  }

  @Override
  public int read(byte[] buffer) throws IOException {
    return in.read(buffer);
  }

  @Override
  public int read(byte[] buffer, int offset, int len) throws IOException {
    return in.read(buffer, offset, len);
  }

  @Override
  public long skip(long n) throws IOException {
   return in.skip(n);
  }
}

final static class ExternalizableOutput extends OutputStream { private final ObjectOutput out; public ExternalizableOutput(ObjectOutput out) { this.out = out; } @Override public void flush() throws IOException { out.flush(); } @Override public void close() throws IOException { out.close(); } @Override public void write(int ch) throws IOException { out.write(ch); } @Override public void write(byte[] data) throws IOException { out.write(data); } @Override public void write(byte[] data, int offset, int len) throws IOException { out.write(data, offset, len); } }

/* Use of helper class here is unfortunate, but necessary; alternative would
* be to use ThreadLocal, and set instance before calling serialization.
* Benefit of that approach would be dynamic configuration; however, this
* approach is easier to demonstrate.
*/
class MapperHolder { private final ObjectMapper mapper = new ObjectMapper(); private final static MapperHolder instance = new MapperHolder(); public static ObjectMapper mapper() { return instance.mapper; } }

and given these classes, we can implement Jackson-for-default-serialization solution.

3. Let's Do a Serialization!

So with that, here's a class that is serializable using Jackson JSON serializer:


  static class MyPojo implements Externalizable
  {
        public int id;
        public String name;
        public int[] values;

        public MyPojo() { } // for deserialization
        public MyPojo(int id, String name, int[] values)
        {
            this.id = id;
            this.name = name;
            this.values = values;
        }

        public void readExternal(ObjectInput in) throws IOException {
            MapperHolder.mapper().readerForUpdating(this).readValue(new ExternalizableInput(in));
} public void writeExternal(ObjectOutput oo) throws IOException { MapperHolder.mapper().writeValue(new ExternalizableOutput(oo), this); }
}

to use that class, use JDK serialization normally:


  // serialize as bytes (to demonstrate):
MyPojo input = new MyPojo(13, "Foobar", new int[] { 1, 2, 3 } ); ByteArrayOutputStream bytes = new ByteArrayOutputStream(); ObjectOutputStream obs = new ObjectOutputStream(bytes); obs.writeObject(input); obs.close(); byte[] ser = bytes.toByteArray();

// and to get it back:
ObjectInputStream ins = new ObjectInputStream(new ByteArrayInputStream(ser)); MyPojo output = (MyPojo) ins.readObject();
ins.close();

And that's it.

4. So what's the benefit?

At this point, you may be wondering if and how this would actually help you. Since JDK serialization is using binary format; and since (allegedly!) textual formats are generally more verbose than binary formats, how could this possibly help with size of performance?

Turns out that if you test out code above and compare it with the case where class does NOT implement Externalizable, sizes are:

  • Default JDK serialization: 186 bytes
  • Serialization as embedded JSON: 130 bytes

Whoa! Quite unexpected result? JSON-based alternative 30% SMALLER than JDK serialization!

Actually, not really. The problem with JDK serialization is not the way data is stored, but rather the fact that in addition to (compact) data, much of Class definition metadata is included. This metadata is needed to guard against Class incompatibilities (which it can do pretty well), but it comes with a cost. And that cost is particularly high for small data.

Similarly, performance typically follows data size: while I don't have publishable results (I may do that for a future post), I expect embedded-JSON to also perform significantly better for single-object serialization use cases.

5. Further ideas: Smile!

But perhaps you think we should be able to do better, size-wise (and perhaps performance) than using JSON?

Absolutely. Since the results are not exactly readable (to use Externalizable, bit of binary data will be used to indicate class name, and little bit of stream metadata), we probably do not greatly care what the actual underlying format is.
With this, an obvious choice would be to use Smile data format, binary counterpart to JSON, a format that Jackson supports 100% with Smile Module.

The only change that is needed is to replace the first line from "MapperHolder" to read:

private final ObjectMapper mapper = new ObjectMapper(new SmileFactory());

and we will see even reduced size, as well as faster reading and writing -- Smile is typically 30-40% smaller in size, and 30-50% faster to process than JSON.

6. Even More compact? Consider Jackson 2.1, "POJO as array!"

But wait! In very near future, we may be able to do EVEN BETTER! Jackson 2.1 (see the Sneak Peek) will introduce one interesting feature that will further reduce size of JSON/Smile Object serialization. By using following annotation:

@JsonFormat(shape=JsonFormat.Shape.OBJECT)

you can further reduce the size: this occurs as the property names are excluded from serialization (think of output similar to CSV, just using JSON Arrays).

For our toy use case, size is reduced further from 130 bytes to 109; further reduction of almost 20%. But wait! It gets better -- same will be true for Smile as well, since while it can reduce space in general, it still has to retain some amount of name information normally; but with POJO-as-Arrays it will use same exclusion!

7. But how about actual real-life results?

At this point I am actually planning on doing something based on code I showed above. But planning is in early stages so I do not yet have results from "real data"; meaning objects of more realistic sizes. But I hope to get that soon: the use case is that of storing entities (data for which is read from DB) in memcache. Existing system is getting CPU-bound both from basic serialization/deserialization activity, but especially from higher number of GCs. I fully expect the new approach to help with this; and most importantly, be quite easy to deploy: this because I do not have to change any of code that actually serializes/deserializes Beans -- I just have to modify Beans themselves a bit.

Thursday, May 24, 2012

Doing actual non-blocking, incremental HTTP access with async-http-client

Async-http-client library, originally developed at Ning (by Jean-Francois, Tom, Brian and maybe others and since then by quite a few others) has been around for a while now.
Its main selling point is the claim for better scalability compared to alternatives like Jakarta HTTP Client (this is not the only selling points: its API also seems more intuitive).

But although library itself is capable of working well in non-blocking mode, most examples (and probably most users) use it in plain old blocking mode; or at most use Future to simply defer handling of respoonses, but without handling content incrementally when it becomes available.

While this lack of documentation is bit unfortunate just in itself, the bigger problem is that most usage as done by sample code requires reading the whole response in memory.
This may not be a big deal for small responses, but in cases where response size is in megabytes, this often becomes problematic.

1. Blocking, fully in-memory usage

The usual (and potentially problematic) usage pattern is something like:

  AsyncHttpClient asyncHttpClient = new AsyncHttpClient();
  Future<Response> f = asyncHttpClient.prepareGet("http://www.ning.com/ ").execute();
  Response r = f.get();
byte[] contents = r.getResponseBodyAsBytes();

which gets the whole response as a byte array; no surprises there.

2. Use InputStream to avoid buffering the whole entity?

The first obvious work around attempt is to have a look at Response object, and notice that there is method "getResponseBodyAsStream()". This would seemingly allow one to read response, piece by piece, and process it incrementally, by (for example) writing it to a file.

Unfortunately, this method is just a facade, implemented like so:

 public InputStream getResponseBodyAsStream() {
return new ByteArrayInputStream(getResponseBodyAsBytes());
}

which actually is no more efficient than accessing the whole content as a byte array. :-/

(why is it implemented that way? Mostly because underlying non-blocking I/O library, like Netty or Grizzly, provides content using "push" style interface, which makes it very hard to support "pull" style abstractions like java.io.InputStream -- so it is not really AHC's fault, but rather a consequence of NIO/async style of I/O processing)

3. Go fully async

So what can we do to actually process large response payloads (or large PUT/POST request payloads, for that matter)?

To do that, it is necessary to use following callback abstractions:

  1. To handle response payloads (for HTTP GETs), we need to implement AsyncCompletionHandler interface.
  2. To handle PUT/POST request payloads, we need to implement BodyGenerator (which is used for creating a Body instance, abstraction for feeding content)

Let's have a look at what is needed for the first case.

(note: there are existing default implementations for some of the pieces -- but here I will show how to do it from ground up)

4. A simple download-a-file example

Let's start with a simple case of downloading a large file into a file, without keeping more than a small chunk in memory at any given time. This can be done as follows:


public class SimpleFileHandler implements AsyncHandler<File>
{
 private File file;
 private final FileOutputStream out;
 private boolean failed = false;

 public SimpleFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }
}

Voila. Code is not very brief (event-based code seldom is), and it could use some more handling for error cases.
But it should at least show the general processing flow -- nothing very complicated there, beyond basic state machine style operation.

5. Booooring. Anything more complicated?

Downloading a large file is something useful, but while not a contriver example, it is rather plain. So let's consider the case where we not only want to download a piece of content, but also want uncompress it, in one fell swoop. This serves as an example of additional processing we may want to do, in incremental/streaming fashion -- as an alternative to having to store an intermediate copy in a file, then uncompress to another file.

But before showing the code, however, it is necessary to explain why this is bit tricky.

First, remember that we can't really use InputStream-based processing here: all content we get is "pushed" to use (without our code ever blocking with input); whereas InputStream would want to push content itself, possibly blocking the thread.

Second: most decompressors present either InputStream-based abstraction, or uncompress-the-whole-thing interface: neither works for us, since we are getting incremental chunks; so to use either, we would first have to buffer the whole content. Which is what we are trying to avoid.

As luck would have it, however, Ning Compress package (version 0.9.4, specifically) just happens to have a push-style uncompressor interface (aptly named as "com.ning.compress.Uncompressor"); and two implementations:

  1. com.ning.compress.lzf.LZFUncompressor
  2. com.ning.compress.gzip.GZIPUncompressor (uses JDK native zlib under the hood)

So why is that fortunate? Because interface they expose is push style:

 public abstract class Uncompressor
 {
  public abstract void feedCompressedData(byte[] comp, int offset, int len) throws IOException;
  public abstract void complete() throws IOException;
}

and is thereby usable to our needs here. Especially when we use additional class called "UncompressorOutputStream", which makes an OutputStream out of Uncompressor and target stream (which is needed for efficient access to content AHC exposes via HttpResponseBodyPart)

6. Show me the code

Here goes:


public class UncompressingFileHandler implements AsyncHandler<File>,
   DataHandler // for Uncompressor
{
 private File file;
 private final OutputStream out;
 private boolean failed = false;
 private final UncompressorOutputStream uncompressingStream;

 public UncompressingFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   // if compressed, pass through uncompressing stream
   if (uncompressingStream != null) {
    part.writeTo(uncompressingStream);
   } else { // otherwise write directly
    part.writeTo(out);
   }
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (uncompressingStream != null) {
   uncompressingStream.close();
  }
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // must verify that we are getting compressed stuff here:
  String compression = h.getHeaders().getFirstValue("Content-Encoding");
  if (compression != null) {
   if ("lzf".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new LZFUncompressor(this));
   } else if ("gzip".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new GZIPUncompressor(this));
   }
  }
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }

 // DataHandler implementation for Uncompressor; called with uncompressed content:
 public void handleData(byte[] buffer, int offset, int len) throws IOException {
  out.write(buffer, offset, len);
 }
}

Handling gets bit more complicated here, since we have to handle both case where content is compressed; and case where it is not (since server is ultimately responsible for applying compression or not).

And to make call, you also need to indicate capability to accept compressed data. For example, we could define a helper method like:


public File download(String url) throws Exception
{
 AsyncHttpClient ahc = new AsyncHttpClient();
 Request req = ahc.prepareGet(url)
  .addHeader("Accept-Encoding", "lzf,gzip")
  .build();
 ListenableFuture<File> futurama = ahc.executeRequest(req,
new UncompressingFileHandler(new File("download.txt"))); try { // wait for 30 seconds to complete return futurama.get(30, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { throw new IOException("Failed to download due to timeout"); } }

which would use handler defined above.

7. Easy enough?

I hope above shows that while doing incremental, "streaming" processing is bit more work, it is not super difficult to do.

Not even when you have bit of pipelining to do, like uncompressing (or compressing) data on the fly.

Friday, April 06, 2012

Take your JSON processing to Mach 3 with Jackson 2.0, Afterburner

(this is part on-going "Jackson 2.0" series, starting with "Jackson 2.0 released")

1. Performance overhead of databinding

When using automatic data-binding Jackson offers, there is some amount of overhead compared to manually writing equivalent code that would use Jackson streaming/incremental parser and generator. But how overhead is there? The answer depends on multiple factors, including exactly how good is your hand-written code (there are a few non-obvious ways to optimize things, compared to data-binding where there is little configurability wrt performance).

But looking at benchmarks such as jvm-serializers, one could estimate that it may take anywhere between 35% and 50% more time to serialize and deserialize POJOs, compared to highly tuned hand-written alternative. This is usually not enough to matter a lot, considering that JSON processing overhead is typically only a small portion of all processing done.

2. Where does overhead come?

There are multiple things that automatic data-binding has to do that hand-written alternatives do not. But at high level, there are really two main areas:

  1. Configurability to produce/consume alternative representations; code that has to support multiple ways of doing things can not be as aggressively optimized by JVM and may need to keep more state around.
  2. Data access to POJOs is done dynamically using Reflection, instead of directly accessing field values or calling setters/getters

While there isn't much that can be done for former, in general sense (especially since configurability and convenience are major reasons for popularity of data-binding), latter overhead is something that could be theoretically eliminated.

How? By generating bytecode that does direct access to fields and calls to getters/setters (as well as for constructing new instances).

3. Project Afterburner

And this is where Project Afterburner comes in. What it does really is as simple as generating byte code, dynamically, to mostly eliminate Reflection overhead. Implementation uses well-known lightweight bytecode library called ASM.

Byte code is generated to:

  1. Replace "Class.newInstance()" calls with equivalent call to zero-argument constructor (currently same is not done for multi-argument Creator methods)
  2. Replace Reflection-based field access (Field.set() / Field.get()) with equivalent field dereferencing
  3. Replace Reflection-based method calls (Method.invoke(...)) with equivalent direct calls
  4. For small subset of simple types (int, long, String, boolean), further streamline handling of serializers/deserializers to avoid auto-boxing

It is worth noting that there are certain limitations to access: for example, unlike with Reflection, it is not possible to avoid visibility checks; which means that access to private fields and methods must still be done using Reflection.

4. Engage the Afterburner!

Using Afterburner is about as easy as it can be: you just create and register a module, and then use databinding as usual:


Object mapper = new ObjectMapper()
mapper.registerModule(new AfterburnerModule());
String json = mapper.writeValueAsString(value);
Value value = mapper.readValue(json, Value.class);

absolutely nothing special there (note: for Maven dependency, downloads, go see the project page).

5. How much faster?

Earlier I mentioned that Reflection is just one of overhead areas. In addition to general complexity from configurability, there are cases where general data-binding has to be done using simple loops, whereas manual code could use linear constructs. Given this, how much overhead remains after enabling Afterburner?

As per jvm-serializers, more than 50% of speed difference between data-binding and manual variant are eliminated. That is, data-bind with afterburner is closer to manual variant than "vanilla" data-binding. There is still something like 20-25% additional time spent, compared to highest optimized cases; but results are definitely closer to optimal.

Given that all you really have to do is to just add the module, register it, and see what happens, it just might make sense to take Afterburner for a test ride.

6. Disclaimer

While Afterburner has been used by a few Jackson users, it is still not very widely used -- after all, while it has been available since 1.8, in some form, it has not been advertised to users. This article can be considered an announcement of sort.

Because of this, there may be rought edges; and if you are unlucky you might find one of two possible problems:

  • Get no performance improvement (which is likely due to Afterburner not covering some specific code path(s)), or
  • Get a bytecode verification problem when a serializer/deserializer is being loaded

latter case obviously being nastier. But on plus side, this should be obvious right away (and NOT after running for an hour); nor should there be a way for it to cause data losses or corruption; JVMs are rather good at verifying bytecode upon trying to load it.

Monday, April 04, 2011

Introducing "jvm-compressor-benchmark" project

I recently started one new open source project; this time being inspired by success of another OS project I had been involved in, project is "jvm-serializers" benchmark originally started by Eishay and built by a community of java databinder/serializer experts. What has been great with this project has been amount of energy it seemed ot feed back to development of serializers: highly visible competition for best performance seems to have improved efficiency of libraries a lot. I only wish we had historical benchmark data to compare to see exactly how far have the fastest Java serializers come.

Anyway, I figured that there are other groups of libraries where high performance matters, but where there is lack of actual solid benchmarking information. So while there are a few compression performance benchmarks, they are often non-applicable for Java developers: partly because they just compare native compressor codecs, and partly because focus is more often only on space-efficiency (how much compression is achieved) with little consideration of performance of compression. The last part is particularly frustrating as in many use cases there is significant trade-off between space and time efficiency (compression rate vs time used for compression).

So, this is where the new project -- "jvm-compressor-benchmark" -- comes from. I hope it will allow fair comparison of compression codecs available on JVM, to be used by Java and other JVM lagnuages; and also bring in some friendly competition between developers of compression codecs.

First version compares half a dozen of compression formats and codecs, from the venerable deflate/gzip (which offers pretty good compression ratio with decent speed) to higher-compression-but-slower-operation alternatives (bzip2) and lower-compression-but-very-fast alternatives like lzf, quicklz and the new kid on the block, Snappy (via JNI).

And although the best way to evaluate results is to run tests on your machine, using data sets you care about (which I strongly encourage!), Project wiki does have some pretty diagrams for tests run on "standard" data sets gathered from the web.

Anyway: please check the project out -- at the very least it should give you an idea of how many options there are above and beyond basic JDK-provided gzip.

ps. Contributions are obviously also welcome -- anyone willing to tackle Java version of 7-zip's LZMA, for example, would be most welcome!

Tuesday, January 04, 2011

Annual Update on State of Java JSON data binding performance

Yes, 'tis the season for performance measurements again: last time I covered this subject about a year ago with "JSON data binding performance (again!)".
During past 12 months many of the tested libraries have released new versions; some with high hopes for performance improvements (Gson 1.6, for example, was rumored to have faster streaming parser). So it seems prudent to have another look at how performant are Java JSON data binding libraries currently available.

1. Libraries tested

Libraries tested are the same as last time; with versions:

  1. Jackson 1.6 (was 1.2)
  2. Gson 1.6 (was 1.4)
  3. Json-tools (core, 1.7) (no change)
  4. Flex-json 2.1 (was 1.9.1)

(for what it's worth, I was also hoping to include tests for "json-marshaller", but lack of documentation coupled with seeming inability to parse directly from stream or even String suggested that it's not yet mature enough to be included)

2. Test system

I switched over to the light side at home (replaced my old Athlon/Linux box to a small spunky Mini-Mac, yay!), so the test box has a 2.53 GHz Intel Core 2 Duo CPU. This is somewhat box; it seems to be about 4x as fast for this particular task. Test is single-threaded, so it would be possible to roughly double the throughput with different Japex test setup; however, test threads are CPU-bound and independent so seems to be little point in adding more concurrency.

Test code is available from Woodstox SVN repository (under "staxbind" folder), and runs on Japex. Converters for libraries are rather simple; data consists of medium-sized (20k) documents for anyone interested in replicating results.

3. Pretty Graphs

Ok, here is the main graph:

Data Binding Performance Graph

and for more data, check out the full results.

4. "But what does it MEAN?"

As with previous tests, upper part of double-graph just indicates amount of data read and/or written (which is identical for first three, but flex-json seems to insist including some additional type metadata), which can be ignored; the lower-graph indicates through-put (higher bar means faster processing) and is the interesting part.

There are three tests; read JSON (into data object(s)), write JSON (from data object(s)) and read-then-write which combines two operations. I use last result, since it gives reasonable approximation for common use with web services where requests are read, some processing done, and a response written.

From graph it looks like results have not changed a lot; here is the revised speed ratio, using the slowest implementation (Gson) as the baseline:

Impl Read (TPS) Write (TPS) Read+Write (TPS) R+W, times baseline
Jackson (automatic) 7240.202 9161.873 4023.464 14.75
FlexJson 721.743 1402.848 462.594 1.69
Json-tools 524.119 1007.068 341.123 1.25
GSON 714.106 462.935 272.637 1

5. Thoughts?

Not much has changed; Jackson is still an order of magnitude faster than the rest, and relative ranking of the other libraries hasn't changed either.

Friday, October 29, 2010

Faster compression on Java platform with LZF

One open source project that I have worked on a bit lately is compress-lzf, a simple Java library that implements block- and stream-accessible version of LZF compression algorithm. LZF algorithm itself is very simple, and mostly (only?) documented here, in form of working reference implementation that doubles as documentation.

Our implementation is a simple rewrite of C version, and fully compatible with the reference implementation including file header and block signature (this is different from the other existing Java version that projects like H2 include). While simple implementation, it is not naive translation; and some effort has been spent on optimizing its performance.

And as should be the case for such low-level libraries, there are no external dependencies beyond JDK 1.5; meaning it should work on variety of platforms, including Android (in fact it could probably be compiled on any 1.x JDK)

1. What is LZF?

LZF compression algorithm is basic Lempel-Ziv, with little additional processing; so one way to think of it is that it is "gzip without Huffman encoding" (gzip uses Deflate, which is Lempel-Ziv followed by Huffman encoding, with variable bit-length).
And Lempel-Ziv is just method whereby one finds repeating sub-sequences in output, replacing sequences of three or more "already seen bytes" with a back-reference; matches are typically stored in a hash table indexed by hash of first 3 bytes, value being offset of last instance of potentially matching sequence.

It is worth noting that many other similar (and sometimes similarly named) algoritms likewise use basic LZ; "Fastlz" for example is very similar.

2. Gzip exists, why bother with LZF?

But why does this matter? If gzip is this and more, why would you want to use a "lesser" algorithm?

Key benefit is that of speed: LZF can be up to twice as fast to uncompress; but where it REALLY shines is compression speed -- here it can be 2x or 3x as fast (writing a stand-alone benchmark is on my todo list, so hopefully I will get to talk a bit more about this in future). Performance difference is mostly due to relatively high overhead that deflate uses for non-byte-aligned output (which is needed for optimal Huffman-encoding of codewords) -- while it definitely helps compression, it is expensive in terms of CPU usage.

LZF compression rate is lower than with gzip, but compression profile is similar: things that compress well with gzip also compress well with LZF; and those that compress less do so with LZF as well. This makes sense given that most compression often comes from LZ algorithm, which is implemented similarly.

So it all comes down to this: gzip compression is very CPU-intensive (read: slow); so much so that you typically use 5x as long to compress things like JSON data than writing content itself. With fast networks and I/O this often becomes bottleneck.
LZF can be a good alternative since its CPU overhead is much more modest; and it can still compress "fluffy" content pretty well -- where gzip can compress something by, say, 90%, LZF usually gets to 75-80% ratio (this obviously varies by types of content; basically depends on which part of deflate algorithm is more effective)

3. Usage

Assuming that above sounds interesting, you can get compress-lzf library either from GitHub; or via Maven. In latter case, information is:

  • <groupId>com.ning</groupId>
  • <artifactId>compress-lzf</artifactId>
  • current version at time of this writing is "0.6.0"

And from there you can use class from com.ning.compress.lzf package:

  • LZF is a simple command-line utility you can use for manual compression/uncompression (jar defines this class as the Main-Class so you can just invoke it with "java -jar compress-lzf.jar" to get usage information)
  • LZFInputStream and LZFOutputStream are compressing input (read LZF compressed data) and output (write LZF compressed data) streams, similar to JDK's GZIPInputStream and GZIPOutputStream
  • LZFEncoder and LZFDecoder are block-based codecs that can be used to encode (compress) and decode (uncompress) chunks (byte[]) of data

Both streaming and block-based interfaces have their uses; for example, when storing BLOBs in database, block API works better; and when reading data over network or from filesystem, streams make more sense.

3.1 Note: avoid LZFOutputStream.flush() ("just let it float...")

One note on LZFOutputStream: be careful with flush()ing -- just like other block compressors (including gzip), LZF can only do back-references within single block; and since LZFOutputStream.flush() forces closing of an open block, it can dramatically reduce compression rate. As such it is best to avoid flush()ing if at all possible, unless you actually want to create a logical boundary at some point.

4. Future plans for the library

While there are always improvements for any Open Source projects, I think the natural scope for LZF library is quite small. Similar to Java Uuid Generator, it is quite possible that there won't be need for major changes to implementation or API; which means that future improvements are most likely in areas of integration; helping other projects use LZF.

For example, given unique CPU/io/bandwidth trade-off LZF can offer, it could make a really nice component for large-scale data transfer and processing, for example:

  • Data processing with map/reduce (Hadoop)
  • Data storage (with frameworks like jDBI, Hibernate; or with NoSQL projects like Voldemort, Cassandra)
  • Data aggregation (logging, log processing, Scribe)
  • Caching: use LFZ with memcached?
  • Combination of high-performance data formats (Jackson Smile, binary JSON serialization) and compression -- can be combined with other use cases too.

So it will be interesting to see where things go in near future.

5. On benefits of working for a company that is comfortable with Open Source

Unlike with many earlier Open Source projects that I have worked on, much of work that went into this open source library was done on company time. It also was a team effort (Jon wrote the streaming handling, I wrote block codec); compared to most other open source projects where I have been the main author. So I have come a long way, having worked for a semi-OS-hostile company, then a bit for a neutral (agnostic?) company; and now working for one hat is actually engaged with community (Ning). And I must admit that I really like that Ning is openly committed to working with open source projects, not just using open source artifacts others make. This is not solely meant as an ad for Ning (although we are indeed hiring...), I just think it is important to give credit where it is due.

Beyond this, Ning section at Github already has some public projects, and there are literally a dozen more being cleaned up to be ready for publishing. Within just a few more months I think Ning will be more recognized for its contributions in Open Source space; and eventually become a signifant player. I see this as a major benefit, making it bit less difficult to hire the best developers; and especially so for a company that is still relatively small, and thus having less name recognition than the big boys.

Thursday, October 14, 2010

More on Java UUID Generator (JUG), a word on performance

Ok, so Java UUID Generator version 3 is out. Beyond question of what is new with this version (which I outlined in the previous entry), the next asked was "how is performance?".
My first thought was that it should be about the same as what version 2.0 has; but after some more thinking I realized that this is not a very useful answer to give.

So I wrote a simple micro-benchmark to find out (it's in 'src/main/java/perf/MeasurePerformance.java') more about performance aspects. Test uses three main generation methods (random-, time- and name-based) for JUG, and matching alternatives that java.util.UUID has (random- and limited named-based variant).

1. Performance of UUID generation, general observations

High-level summary would be:

  1. Performance between equivalent methods between JDK and JUG have similar performance (no surprise) -- JDK just does not provide time-based variant (one most commonly used by non-Java systems); nor ability to configure Random-based variant
  2. Performance between three main generation types varies a lot.

2. Actual numbers

So how big are differences? Here is actual snapshot from running the test on my MacBook: each run generates 250,000 UUIDs, and time taken (after test ran for about a minute) is reported as milliseconds:

  Test 'JDK, random' -> 974 msecs
  Test 'JDK, name' -> 230 msecs
  Test 'Jug, SecureRandom' -> 971 msecs
  Test 'Jug, java.util.Random' -> 33 msecs
  Test 'Jug, time-based' -> 39 msecs 
  Test 'Jug, name-based' -> 256 msecs

Since these values fluctuate a bit, exact values are not as important as ratios. Random-based variants are slowest for both implementations (except for case of using 'java.util.Random' -- more on this in a bit); name-hash (MD5 by default; SHA-1 available with Jug too) methods are bit faster; and time/location-based variant is rather fast, close to 50% of maximum theoretical per-node generation rate.

So generation rates for this system would be:

  • Random-based method (using SecureRandom) generates about 250k UUIDs per second
  • Name-based variant (MD5) without namespace is almost 4 times as fast, generating about 1 million UUIDs per second
  • Time-based variant (Ethernet address plus timestamp) is much faster -- almost 20 times as fast as Random-based default variant -- generating about 5 million UUIDs per second. This is also close to theoretical maximum rate for the method, which is 10 million UUIDs per second (sustained), because timestamp granularity (100 nanoseconds) prevents higher throughput (to work around this, one can generate parallel generators using different Ethernet address)

Quite a difference! It is worth noting, too, that speed of name-based method is related to both existence of namespace (this test did not use one; adding one slows things down slightly) and length of name used; latter because hash is calculated on name used. And finally, hash method used also matters (SHA-1 is slower than MD5).

But what is up with the random variant? And didn't I forget one impressive number from my analysis above?

3. Why is Random-based generation so slow?

Constructing "secure" pseudo-random numbers is not easy, and this is the main reason for slowness. Looking at stack traces, it looks like JDK java.security.SecureRandom actually uses random device provided by the underlying operating system; and this is part of the slowness (due to use of JNI). But generation itself is fundamentally slow, even if no system calls were needed. It is quite possible that faster machines might not even speed up performance significantly.

4. Couldn't we just use 'java.util.Random' to speed things up?

But if the "secure" (i.e. "good", meaning one providing good level of randomness) variant is slow, couldn't we use a faster one? Like "java.util.Random"? In fact, didn't numbers above show using it resulting in version as fast as using time-based variant?

Yes, this is possible: and JUG allows you to pass any java.util.Random instance you want (it just defaults to SecureRandom).
But speed is not everything: the main point of UUIDs is the uniqueness, and this is why SecureRandom is used as the default by both JUG and JDK. And while the default pseudo-random number generator that java.util.Random uses is fine for many uses, I am not sure I would trust its simple algorithm for generating UUIDs; even if initialized using system clock (System.currentTimeMillis()). The underlying algorithm just does not have any actual source of randomness beyond initial seed value and I do not trust that its random number distribution, then, is good enough to give strong guarantees of uniqueness (I would love pointers to someone discussing actual measurements or research on this!).

So to me downgrading to using simple text-book pseudo-number algorithms does not seem tempting, even if it speeds things up by an order of magnitude.

5. So what would make more sense?

Well, looking at numbers, how about consider time-based variant? Name-based variant is not quite as useful; not so much due to performance aspects, but since it actually requires you to hand source of uniqueness (name to hash).

6. Thoughts on relevance of above

I was bit surprised by how big the difference actually is. But whether it really matters depends on use case -- oftentimes you don't need even thousands of UUIDs per second (most services can not really serve even one thousands requests per second), so the rate of a quarter million per second is just fine. After all, it only takes about four microseconds (on average) to generate one UUID.

But there are use cases where this matters more; especially if UUIDs are used liberally, for example using UUIDs as identifiers for all business objects generated during processing.
So if your use case truly depends on (or at least benefits from) fast generation of UUIDs, you probably want to use time-based generation method from Jug.

7. How did I do that again?

Ok; since I haven't yet gotten Javadocs set up, here is the way to generate time-based UUIDs.

  // need to pass Ethernet address; can either use real one (shown here)
  EthernetAddress nic = EthernetAddress.fromInterface();
  // or bogus which would be gotten with: EthernetAddress.constructMulticastAddress()
  TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(nic);
  // also: we don't specify synchronizer, getting an intra-JVM syncer; there is
  // also external file-locking-based synchronizer if multiple JVMs run JUG
  UUID uuid = uuidGenerator.generate();

And that's it. Other variants are generated similarly, by creating generator instance with Generators factory.

Monday, June 28, 2010

Async HTTP Client 1.0 released

Ok, this announcement went out last week, but it is important to re-iterate: version 1.0 of Ning async HTTP Client is now out!

This is good news for multiple reasons: obviously ability to do asynchronous (read: non-blocking, i.e. no need to use one thread for each and every open connection) HTTP communication is valuable in itself. But I also hope that this spurs some friendly competition in improving all HTTP-based communication on Java platform -- there are still features that are not implemented, and due to complexity of efficient connection handling, there should be still room for improvement with performance as well. And finally this should lead to wider adoption of this relatively new library, so that it gets properly battle-tested and proven.

I am actually planning on using this client for cases where regular blocking client could work, to see how well it performs and exactly how easy it is to use non-blocking API, compared to existing blocking alternatives.

Tuesday, June 08, 2010

Introducing My Newest Open Source project: Tr13 (that is so 1337!)

I have briefly mentioned that I have been working on a new little low-level data structure package, called (Ning) Tr13, started at GitHub. Now that code is getting ready and tested, it is time to "formally" introduce it.

Tr13 is an interesting project form multiple angles: for one, it is my first "corporate-sponsored" Open Source project (i.e. related to may day-to-day work -- although not focus of it, just a useful thing for me and hopefully my colleagues at Ning). This alone is worth celebrating; the whole process has been remarkably smooth from 15-minute end-to-end approval process (take note Amazon developer community...), tosteady if somewhat slow development itself (there is nothing groundbreaking from CS perspective -- more on this below).
This is also a rare case of my already having reasonably clear idea as to where I would be using this package. It may sound surprising, but truthfully most open source libraries I have written, I have not actually absolutely needed them before writing. :-)
(there are exceptions: Java UUID Generator is something I wrote since I needed it -- but not so for Woodstox, or even Jackson, to some degree, both of which I have extensively used after I wrote them).
And finally, well, it is a change from my more common data format parsing / data binding work. Although I was thinking of "climbing up the stack" to write something higher level, I ended up going the other way, a slight detour.

Anyway, I digress. So...

What exactly is tr13?

It is a space-efficient ("compact") read-only implementation of Trie data structure, with slight modification to use Patricia/Radix structure for leaves (but not for branches, since that did not seem useful for my use cases).

Current implementation supports two kinds of Tries:

  1. Byte[] key, variable integer (~= long) value
  2. Byte[] key, byte[] value

In both cases, byte[] is typically a String, but can be any other serializable thing; for example, ints and longs could be trivially converted to 4/8 byte keys. Separate variant for variable-length integers is further optimization for my initial use case, where values are generally small ints, which can fit in a single byte, but where it is convenient not to be bound by arbitrary limits (like 256).

Internally structure is actually just a regular byte array (or, alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format needs to be documented, but is a rather simple implementation of basic trie ideas, nothing to write a dissertation about.

Why does it Rock?!

Given how good plain old JDK HashMap can be -- you could just use HashMap<String,Integer> for the first case -- what is so special about tr13?

Simply put: memory usage, or lack thereof (ditto for GC activity). This because:

  1. Overall memory usage is pretty tight -- for test data I have (30 million entries of typically 13-byte key, small integer number == 600 meg input file), trie structure actually uses less memory than its file representation (by about 35% less). And,
  2. Since it is all in just TWO objects (raw byte array, wrapper Trie object), it should keep GC rather happy.

This is not coincidental, since the use case we have is that of sizable routing tabls that should ideally be kept fully in-memory. Hence compactness of data structure, with reasonable lookup times (i.e. no linear search), was design criteria. Even other somewhat frugal data structures (like b-trees) are not anywhere as space efficient.

Downsides?

First obvious limitation is that it is a read-only data structure: you must have all the data at hand when building it. Data also has to be ordered, not necessarily alphabetically (although that is the usual way), but in a stable lexicographic ordering, such that build process has invariants it needs. This is not as big a limitation as it might seem, at least when merging functionality (more on this later on) is added. For now it may be necessary to re-build data structure regularly, and keep an additional "short-term" lookup data structure at hand for additions that are needed between re-builds.

As to performancem resulting data structure is not the fastest way to find things: from complexity standpoint it is a linear algorithm, compared to theoretically constant time for hash-based alternatives. This may or may not matter; it should not matter a lot with respect to key length (HashMap has to calculate hash for key and compare key bytes, unless keys can be canonicalized). Another potential performance concern is that trie data structure does not have much locality; but this too is unlikely to be much worse than for typical alternatives. In fact, due to high memory efficiency, and use of plain byte array, it is not guaranteed that HashMap has any better locality, even if a single lookup results in a match (this due Java GC's habit of "shuffling" object data around).

All in all, Tr13 does much more work when finding values, since it traverses data structure once per each key byte (except for Patricia-optimized tail, which may be multi-byte linear match). So I would not expect it to be as fast as direct hash lookups.

But in the end, these are speculations, and the real test would be use.

... or, how about a performance test?

Performance Testing

Ok: the question of efficiency is something worth looking deeper into. And so I wrote a benchmark that compares VIntTrieLookup (byte[]-to-vint trie) against HashMap<String,Integer> (for curious, it's under 'src/java/test/manual/' in source tree). Test loads all the entries in memory, and then performs randomized lookups (one lookup for every key, but order shuffled so as to be different from 'natural' ordering of trie).

When given a subset of the whole data file (turns out HashMap hogs so much memory that I just can't test the whole data set!) -- a 32 meg, 2 million entry chunk -- here are the results:

Memory usage:

  1. Tr13: 20.6 megs (reduction of about one third)
  2. HashMap<String,Integer>: 314 megs

So in this case, HashMap<String,Integer> uses about 15 times more memory than Tr13 (note: Integer values are canonicalized using Integer.valueOf -- without this, usage would be 360 megs).

And what about performance? What is the cost of compactness?

For 3 million lookups, time taken was:

  1. Tr13: about 5000 milliseconds (about 400,000 lookups per second)
  2. HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000 lookups per second)

so HashMap is about 4 times faster than Tr13 for lookups, on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.

Since my first need is to be able to do lookups in memory, performance difference is not of particular importance (there might be one or two lookups per request, so anything in tens of thousands per second would work), but it is good to know rough order of magnitude difference.

What next?

At this point Tr13 should be ready for use, and I plan to release version 0.5 soon. The only thing I am waiting for is to get a Maven repository set up at Sonatype OSS repository (kudos to Sonatype for this free hosting!), so that Maven-based (and I assume, Ivy-based) projects can use it. This is actually what I need for my daytime use.

But beyond Maven deployment, there are some additional features I want to implement in near future, such as:

  • Ability to traverse through entries (Map.Entry<> style)
  • Ability to merge two tries efficiently -- could be built quite easily given iterator functionality
  • ... document the data structure (ok ok, not a development task, but "hard" (read: tedious) yet necessary work) -- it is very simple, but not so simple that reading code would make it obvious.
  • Current implementations are limited to 2 gigs, due to Java byte[] and ByteBuffer limitations. But fundamentally there is nothing particularly hard about writing segmented versions where only segments would have this limitation (well, probably values too, but I don't see any need for 2gb+ values in my tries...)

Merging functionality specifically would be useful for "semi-mutable" lookup data structures. For data that evolves slowly (which is the case for routing table use case I have), it is possible to keep track of needed modifications (in a, say, HashMap), and do primary lookup against that data structure, and only if no match is found, against immutable trie. And when this short-term data structure goes enough, or enough time passes, trie could be re-built with existing and new data. Repeat and rinse.

And perhaps it might even be possible to work some more on fundamental data structure itself, and find ways to make it mutable. Insertions (and deletion) operations would be rather costly, due to global changes needed (another downside of compactness is that changes tend to spill far). But even slow mutability would be better than no mutability; and perhaps there could be ways to trade some space (intentional unused regions within trie, to be used for insertions?) for lesser likelihood of having move all the data around.

Wednesday, April 28, 2010

Making Java Reflection field access 40 times faster

Think it can not be done? Think again!

Here is a useful nugget of information that should be common knowledge among performance-conscious Java developers:but based on code I've seen in the wild it might not yet be, hence this entry.

Basically, if you just do this:

  Field f = MyBean.class.getDeclaredField("id");
  f.set(beanInstance, valueForField);

performance stinks -- setting value is about hundreds of times slower than direct field access. But if you add this little call:

  f.setAccessible(true);

in there, overhead is drastically reduced. In a simple micro-benchmark running on my old desktop speed difference is 40x, and while YMMV, it will be an order of magnitude or more and this ratio does not seem to improve with new JVMs (has been steady with 1.4, 1.5 and 1.6). This actually takes overhead down enough so as to make reflection-based access "fast enough" for most use cases: Jackson for example just uses it and does not bother with byte-code generation to create most optimal access (but who knows? maybe it will one day).

So what is the difference? Access checks that would be done by default are heavy-weight, as can be seen from profiler stack traces and JDK sources. And all setAccessible() method does is set a boolean flag that will by-pass these checks. Works for Methods and Constructors as well, although with slightly less pronounced results.

Anyway, thought others might find this useful.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.