Thursday, December 17, 2009

On good, efficient data formats

There are 2 fairly recent additions to category of "good binary data formats that work nice with Java" category: Avro and Kryo. I have meant to write something about both for a while.

1. Avro

Avro is a simple and efficient general-purpose data format, developed as part of Hadoop project. Due to its background, it should work very well with Hadoop (and map/reduce systems in general). It is also quite similar to what I had been thinking of implementing for my own (well, my employer's, rather) large-scale data processing needs, when there was no Avro. From my shallow understanding of Avro, it seems to nicely fit the bill for data format for huge sequence of records; but with self-describing property that is sadly lacking from other contestants like Google's protobuf.

I will hopefully have some more tangible notes to add in future: for now it's enough to note that Avro's performance seems to be pretty good, at least in the "thrift-protobuf" benchmark.

2. Kryo

Strictly speaking Kryo is not a data format, but rather Java object serialization framework that happens to define a data format to handle its main task. But since sending POJOs back and forth over the wire is a very common (perhaps the most common) task for data formats, in Java world, this is not a big difference.

First thing I noticed was its good performance (on above-mentioned benchmark). This is nice, since there is already JDK default serialization that performs adequately for most tasks; so anything else that does binary serialization should be able to meet and beat that performance baseline, to be of interest. But as importantly, API seems straight-forward, simple, and adequately customizable.
So I have reasonably high expectations for this library -- it could be nice complement to something like, say, Dirmi (RMI alternative for JDK default one -- should be coupled with alternative, similarly improved, serialization mechanism, n'est pas?).

3. Disclaimer

Alas, I have not made up compelling use case for using these two projects, yet. But given promise they hold, I should be able to test them out come next year.

Tuesday, December 08, 2009

JSON data binding performance (again!): Jackson / Google-gson / JSON Tools... and FlexJSON too

(note: this is a follow-up on an earlier measurements)

1. A New Contestant: FlexJson

After realizing that FlexJson is actually capable of both serialization and deserialization (somehow I thought it would only serialize things), I decided to add it as the fourth contestant in the "full service Java/JSON data binding" category of tests.

Initially I was bit discouraged to find that it makes one rookie mistake: assumes that somehow JSON comes in (and goes out) as Java Strings. But aside from this glitch, package actually looks quite solid -- and its exclusion/inclusion mechanism looks interesting. Maybe not exactly my cup of joe (if it was, after all, Jackson API would look more like it does), but a viable alternative. And I can see how ability to prevent deep copy would come in handy sometimes. And finally, some of the features actually exceed what Jackson can currently do, regarding polymorphic deserialization (since FJ includes class name by default, I assume it can do it) and some level of cyclic-dependency handling (ignoring serialization of cyclic references at least).

So let's see how "rookie" (yes, I know, it's not exactly a new package, just new addition to the test) fares...

2. Test setup

Tests are run using nice Japex performance test framework, running on my somewhat old AMD work station (~1700 Ghz Athlon -- someone needs to click on those right-hand-side ads to get me a new performance-testing work station! :-) ).

Input data used consists of serialization of tabular data (database dump, good old "db100.xml" used by countless xml tests), converted to Java POJOs, and then to individual data formats (here as JSON, but can be tested as XML and whatnot). Document size is 20k in XML, and slightly less in JSON (about 16k). It would be easy to run using other data sets, but in the past, performance ratios for 2k, 20k and 200k documents have not had radical differences, so 20k one seems like a reasonable choice (but note that the earlier benchmark did in fact use 2k documents, so actual numbers do differ).

Test project itself, "StaxBind" is still in Woodstox SVN repository, accessible via Codehaus SVN page. (one of these days I should just create a Github project -- but not today).

Versions of JSON processing packages are as follows:

  • Jackson 1.2.0
  • Google-gson 1.4
  • Json-tools-core 1.7
  • Flexjson-1.9.1

Code for each library is using default settings, and using what appears as the most efficient interface, for cases where transformations are from byte streams on server side (byte streams in, byte streams out).

3. Results

First things first: here's the money shot:

Data Binding Performance Graph

(or check out the for details)

Another way to represent results is by showing performance ratios, using the slowest implementation as base line (TPS == transactions per second; number of times a 20k document is read, written, or both):

(note: Jackson/manual is omitted since it is hand-written (if simple) serializer/deserializer, and there are no direct counterparts for other packages -- while it would give even bigger faster-than-thou ratio, it wouldn't be a fair comparison)

Impl Read (TPS) Write (TPS) Read+Write (TPS) R+W, times baseline
Jackson (automatic) 1599.272 2463.097 1033.809 25.6
FlexJson 125.277 125.277 94.904 2.35
Json-tools 94.051 126.954 49.008 1.2
GSON 56.58 112.455 40.38 1

So looks like our "new kid on the block" manages to outperform the other two non-Jackson JSON processors here. And at least get within an order-of-magnitude with Jackson... :-)

4. Musings

So it turns out that despite its interfacing (those String/byte conversions), Flexjson package manages to work more efficiently than some other packages that claim "simplicity and performance". And this without actually claiming to be particularly performant, but rather focusing on design of API and ease-of-use aspects. Pretty neat, I respect that.

5. Next?

My current main interest (with respect to performance issues) lie in the area of compressing data for transfer: after all, most of the time there is relative abundance of CPU power compared to available network and I/O bandwidth. This means that trading some CPU (needed for compression and decompression) seems like a bargain for many use cases.

But on the other hand,as we saw earlier, the question is "how much is too much". And that's where my new favorite simple-and-fast algorithm, LZF, comes in. But that's a different story.

Wednesday, November 11, 2009

Just like Java RMI, Just Better: Dirmi

Here is another new interesting new hopefully useful project for developers of distributed Java systems: Dirmi. It is essentially a replacement and upgrade for plain old Java RMI, and both addresses most existing issues with vanilla RMI and extends set of functionality. And does all of that with a few usability improvements. Sounds pretty good to me.

So why do I think this might be a very good replacement? Beyond reading its feature set, I have not yet really used it. But I have confidence knowing its author, who has written such solid packages as Carbonado (ORM to use with BDB, amongst other backends) as Cojen (code generator). I hope Dirmi will get bit more exposure in near future (as well as Carbonado that seems to be somewhat of a well-kept secret and deserves to be more widely known). I will try to write a follow-up if and when I get to play with Dirmi a bit.

Wednesday, September 23, 2009

JSON data binding performance: Jackson vs Google-gson vs BerliOS JSON Tools

UPDATED: see a more up-to-date version here

Earlier I have published some results on performance of "simple" JSON parsing -- simple meaning that processing is manual, to allow for processing JSON using wide variety of Java+JSON tools available. This includes processors from ultra-fast streaming processors (like Jackson) all the way to "good old JSON.org" parser. But it also excluded at least one potentially good tool (google-gson), since it requires "untyped" access, ability to traverse arbitrary JSON structure for testing.

Also: more and more access is nowadays done using a more convenient class of tools, called data binding (or mapping; or sometimes serialization) tools (libraries, packages). In such cases application just asks library to convert JSON to a Java Object (or vice versa), and that's about it. Very convenient; especially for strongly typed web services.

So, with that background, let's see what are performance characteristics of available tools.

1. JSON Data Binding: Contestants

Now, list of tools that allow doing is somewhat limited: I am aware of following:

Given that all of them can do conversions with similar ease (at least for simple Java types), is there much difference in performance? To figure this out, I will be using somewhat incorrectly named StaxBind (really, it should be renamed PojoBind or something) sub-project of Woodstox. Data to bind is a simple rendition of tabular data, with List of beans that contain personal information (name, address and so on); document size (for this test) being about 2 kilobytes.

2. Results!

And yes, indeed, results look vaguely familiar (see here, for example). Considering the "bigger is better" aspect -- value measured, "tps", is number of documents read, written, or read-modify-written per second -- difference from slowest (google-gson) to fastest (Jackson) is a solid order of magnitude.

Data Binding Performance Graph

Looks like Jackson still the King of JSON, regarding processing speed -- and by ridiculously high margin too... If you are already a Jackson user, you may want to congratulate yourself on choosing a very efficient (even green! save those cycles!) tool. A pat on your back might be warranted as well. To put performance in perspective; being able to read ten thousand 2k documents per second (throughput of about 20 megabytes per second), on an almost obsolete AMD Athlon based PC (my home PC) is not too shabby; and all this without little if any glue code.

Actually, as you can see, there is one (and only one!) thing faster than Jackson Data Mapper: "raw" hand-written data mapper. And even that is just a bit faster; probably only worth the extra hand-written code for high-volume use cases, or where number of POJO types is very limited.

3. Some details

Given the big difference in perceived performance, avid readers might be interested in reproducing results, or at least perusing source code. All code is within "staxbind" module in the primary Codehaus Woodstox SVN repository., and author (me!) can be contacted for more details (for some reason Codehaus interface makes access sometimes bit harder than needs be), questions and suggestions.

But there is nothing particularly complicated about code; here's how core methods for tested packages actually look like (interfaces are defined by StaxBind package itself; template T translates to "DbData" (POJO type)).

3.1 Jackson test code

Jackson code is simplest of alternatives, as it supports direct streaming access

public class StdJacksonConverter extends StdConverter
{
ObjectMapper mapper = new ObjectMapper();
//...
public T readData(InputStream in) throws IOException {
return _mapper.readValue(in, _itemClass);
}    
public int writeData(OutputStream out, T data) throws Exception {
JsonGenerator jg = _jsonFactory.createJsonGenerator(out, JsonEncoding.UTF8);
_mapper.writeValue(jg, data);
jg.close();
return -1;
}
}  

3.2 Json-tools test code

Test code here needs a couple of more lines, since there is no way to directly go from POJOs to stream/String and back. But nothing excessive.

public class StdJsonToolsConverter extends StdConverter
{
final JSONMapper _mapper = new JSONMapper();
//...
public T readData(InputStream in) throws Exception {
// two-step process: parse to JSON value, bind to POJO
JSONParser jp = new JSONParser(in);
JSONValue v = jp.nextValue();
return (T) _mapper.toJava(v, _itemClass);
}
public int writeData(OutputStream out, T data) throws Exception {
JSONValue v = _mapper.toJSON(data);
String jsonStr = v.render(false);
OutputStreamWriter w = new OutputStreamWriter(out, "UTF-8");
w.write(jsonStr);
w.flush();
return -1;
}
}

3.3 Google-gson test code

This test code is bit shorter than Json-tools one, since package does not use intermediate tree form. Surprisingly this does not seem to translate to better performance, as the package ends up taking its time doing conversions. On positive note, there should be plenty of room for improvement in this area...

public class StdGsonConverter extends StdConverter
{
final Gson _gson = new Gson();

public T readData(InputStream in) throws IOException {
return _gson.fromJson(new InputStreamReader(in, "UTF-8"), _itemClass);
}

public int writeData(OutputStream out, T data) throws Exception {
OutputStreamWriter w = new OutputStreamWriter(out, "UTF-8");
this._gson.toJson(data, w);
w.flush();
return -1;
}
}

Monday, May 11, 2009

Jackson JSON-processor turns 1.0.0

Ok: it is now official: the official Jackson JSON-processor version 1.0.0 has just been released. Get it while it's Hot!

Wednesday, May 06, 2009

json+gzip nicely packed, but has it Got Speed?

One commonly occuring them on discussions on merits (or lack thereof) is the question "but does the size matter". That is: while textual formats are verbose, they can be efficiently compressed using common every day algorithms like Deflate (compression algorithm that gzip uses). From information theory standpoint, equivalent information should compress to same size -- if one had optimal (from information theory POV) compressor -- regardless of how big the uncompressed message is. And this is quite apparent if you actual test it out in practice: even if message sizes between, say, xml, json and binary xml (such as Fast Infoset) vary a lot, gzipping each gives rougly same compressed file size.

But what is less often measured is how much actual overhead does compression incur; especially relative to other encoding/decoding and parsing/serializing overhead. Given all advances in parsing techniques and parser implementations, this can be significant overhead: compression is much more heavy-weight process than regular streaming parsing; and even decompression has its costs, especially for non-byte-aligned formats.

So: I decided to check "cost of gzipping" with Jackson-based json processing. Using the same test suite as my earlier JSON performance benchmarks, I got following results.

First, processing small (1.4k) messages (database dumps) gives us following results:
(full results here)

and medium sized (16k): (full results)

(just to save time -- results using bigger files gave very similar results as medium ones, regading processing speed)

So what is the verdict?

1. Yes, redundancies are compressed away by gzip

Hardly surprising is the fact that JSON messages in this test compressed very nicely -- result data (converted from ubiquitous "db10.xml" etc test data) is highly redundant, and thereby highly compressible.

And even for less optimal cases, just gzipping generally reduces message sizes by at least 50%; similar to compression ratios for normal text files. This is usually slightly better than what binary formats achieves; oftentimes even including binary formats that omit some of non-redundant data (like Google Protococol Buffers which, for example, requires schema to contain field names and does not include this metadata in message itself).

2. Overhead is significant, 3x-4x for reading, 4 - 6x for writing

But it all comes at high cost: overhead is highest for smallest messages, due to significant fixed initialization overhead cost (buffer allocations, construction of huffman tables etc). But even for larger files, reading takes about three times as long as without compression, if we ignore possible reading speed improvements due to reduced size. And the real killer is writing side: compression is the bottleneck, and you'll be lucky if it takes less than five times as long as writing regular uncompressed data.

3. Is it worth it?

Depends: how much is your bandwidth (or storage space) worth, relative to CPU cycles your programe spends?
For optimal speed, trade-off does not seem worth it, but for distributed systems costs may be more in networking/storage side, and if so compression may still pay off. Especially so for large-scale distributed data crunching, like doing big Map/Reduce (Hadoop) runs.

Or how about this: for "small" message (1.4k uncompressed), you can STILL read 22,000, write 12,000, or read+write 8,000 messages PER SECOND (per CPU). That is, what, about 7900 messages more processed per second than what your database can deal with, in all likelihood. Without compression, you could process perhaps 14,000 more messages for which no work could be done due to contention at DB server, or some other external service... speed only matters if the road is clear.

Yes, it may well make sense even if it costs quite a bit. :-)

4. How about XML?

If I have time, I would like to verify how XML+GZIP combination fares: I would expect same ratios to apply to xml as well. The only difference should be that due to somewhat higher basic overhead, relative additional overhead should be just slightly lower. But only slightly.

Sunday, February 22, 2009

Update I on Update of Json-parsing performance

After writing the entry about parsing performance measurements, I got feedback leading to bit more comple test. Specifically, one of packages (json.simple) actually does offer streaming API as well. So I ended up adding one more test case. Turns out that the package in question gets some measurable boost from this (throughput +15-25%), see the full updated results. And here's the "quick pic" as well"

Performance Graph

Also, one thing the original entry did not cover was how to interpret the results. Here's a brief summary:

  • 'results' marked with 'KB' just indicate size of the parsed document (same for all parsers)
  • actual results are in 'tps' (transactions per second), and "bigger is better": transaction here is a single parse through the doc and accumulation of field counts.

(and for more, you may want to check out how Japex works in general).

Hope this helps.

Tuesday, February 17, 2009

Update on State of Json-parsing Performance

(22-Feb-2009, NOTE: there is an update to this update with even more up-to-date results!)

It has been good year and a half since I blogged about Json performance ("More on JSON performance in Java (or lack thereof)" ).
So it is about time to revisit the question and see what is the state of the art with Java Json processing today.

This time I will be using a bit more full-featured performance benchmark framework: (codename "StaxBind"), which is based on Japex, and allows for easy comparison of different data format / library combination for different tasks. Initially aimed at comparing data binding performance for xml processing (hence the name), it is growing for a more general purpose data format processing performance testing framework.
For now the module is available from Woodstox repository, and contains a few test cases including one used here.

Since benchmarks are run using Japex the results should be more informative as well as reproduceable; plus, we get some pretty graphs to look at.

1. Test case: "json-field-count"

The specific test used from StaxBind is "json-count", test designed to allow testing a wide selection of available Java Json parsers. Test code essentially traverses through given Json documents, counting instances of field names; results are verified before each test to ensure that all parsers (or rather, test drivers for parsers) see the same data.
This traversal operation is not an overly meaningful in itself, but it is easy to implement for most parsers (see below for exceptions), and should be reasonably fair and representative regarding expected processing performance. The other more obvious choice would be a data binding test -- I hope to cover that later on -- but that will mean writing much more test code for packages that do not support automatic data binding.

2. Sample documents

For testing I chose 3 different Json documents:

Document sizes vary from 3 to 15 kB; fairly small, but enough to show the trend about parsing performance. This is not a great set of documents to use, but since there is generally accepted set of Json test documents available (or if there is, please let me know!), it will have to do.

3. Parsers compared

I decided to choose parsers to test from json.org's java parser implementation list. I think that is the most likely starting point for developers; and it is reasonably complete list as well.

Not all listed libraries from the list qualify. Specifically:

  • Some libraries included use another Json parser: for example, both XStream and Jettison use the "json.org" reference implementation as the underlying parser
  • Some libraries can only generate Json, not parse it (such as flex-json
  • One otherwise decent-looking candidate (Google-gson) only implements data-bin ding interface, which is which might may be a decent Json processing package only seems to implement data binding functionality, but not streaming or tree-based alternative (I am hoping to include it in the data-binding tests)

Given this, here are the contestants:

  • Json.org reference implementation: the "standard" choice most developers start with
  • Json Tools from Berlios (full-featured, well-documented)
  • Json-lib (another fairly full-featured package)
  • Json-simple from Google code
  • StringTree JSON (delightfully compact code; alas very simplistic regarding well-formedness checks)
  • Jackson (the reigning champion from the last test), version 0.9.8

Of these, all implement a tree-model; some also implement data binding (json-tools and Jackson at least), but only Jackson appears to implement pure streaming interface. For this reason, there are 2 tests for Jackson: one using Tree model, the other streaming API.

4. Results

After letting Japex churn through the test for almost an hour, we get the actual results. Full result data and graphs can be found here. (also: here are result from another test run (this one with a more modern dual-core system). Both runs are on a Linux desktop machine, using a recent JVM (1.6.0 update 10 or 12)

But here is the main graph (from the first test run) that summarizes results:

Performance Graph

It looks like Jackson is still rather more efficient at parsing than the rest: not only is the core streaming parser very fast, even the tree-based alternative does quite well. In fact, graph readability suffers a little bit from Jackson's dominance.

As for the rest, StringTree parser performs a bit better than the others. But the biggest surprise may be the fact the reference implementation is faster than most alternatives; despite the claims made for these alternatives (I have yet to find a library that doesn't claim to be light-weight and fast :) ). In a way that's good -- at least most developers are not using the slowest available parsers.

Monday, February 09, 2009

Fast Object Serialization with Jackson (json), Aalto (xml)

More interesting performance benchmarking:

http://technotes.blogs.sapo.pt/1708.html

Looks like Aalto is not the only thing that serializes objects to text very fast: Jackson does some ultra-sonic processing as well!
It will be interesting to see how the other side (deserialization, ie. parsing) performs: my experiences suggest that this is where Jackson+json really shines (as well as Aalto, relative to sjsxp). But it is good to know that even serialization is faster with these 2 libraries than with any other textual alternative, bar none.

About me

  • I am known as Cowtowncoder
  • Contact me at @yahoo.com
Check my profile to learn more.

Powered By