Monday, April 04, 2011

Introducing "jvm-compressor-benchmark" project

I recently started one new open source project; this time being inspired by success of another OS project I had been involved in, project is "jvm-serializers" benchmark originally started by Eishay and built by a community of java databinder/serializer experts. What has been great with this project has been amount of energy it seemed ot feed back to development of serializers: highly visible competition for best performance seems to have improved efficiency of libraries a lot. I only wish we had historical benchmark data to compare to see exactly how far have the fastest Java serializers come.

Anyway, I figured that there are other groups of libraries where high performance matters, but where there is lack of actual solid benchmarking information. So while there are a few compression performance benchmarks, they are often non-applicable for Java developers: partly because they just compare native compressor codecs, and partly because focus is more often only on space-efficiency (how much compression is achieved) with little consideration of performance of compression. The last part is particularly frustrating as in many use cases there is significant trade-off between space and time efficiency (compression rate vs time used for compression).

So, this is where the new project -- "jvm-compressor-benchmark" -- comes from. I hope it will allow fair comparison of compression codecs available on JVM, to be used by Java and other JVM lagnuages; and also bring in some friendly competition between developers of compression codecs.

First version compares half a dozen of compression formats and codecs, from the venerable deflate/gzip (which offers pretty good compression ratio with decent speed) to higher-compression-but-slower-operation alternatives (bzip2) and lower-compression-but-very-fast alternatives like lzf, quicklz and the new kid on the block, Snappy (via JNI).

And although the best way to evaluate results is to run tests on your machine, using data sets you care about (which I strongly encourage!), Project wiki does have some pretty diagrams for tests run on "standard" data sets gathered from the web.

Anyway: please check the project out -- at the very least it should give you an idea of how many options there are above and beyond basic JDK-provided gzip.

ps. Contributions are obviously also welcome -- anyone willing to tackle Java version of 7-zip's LZMA, for example, would be most welcome!

Tuesday, January 04, 2011

Annual Update on State of Java JSON data binding performance

Yes, 'tis the season for performance measurements again: last time I covered this subject about a year ago with "JSON data binding performance (again!)".
During past 12 months many of the tested libraries have released new versions; some with high hopes for performance improvements (Gson 1.6, for example, was rumored to have faster streaming parser). So it seems prudent to have another look at how performant are Java JSON data binding libraries currently available.

1. Libraries tested

Libraries tested are the same as last time; with versions:

  1. Jackson 1.6 (was 1.2)
  2. Gson 1.6 (was 1.4)
  3. Json-tools (core, 1.7) (no change)
  4. Flex-json 2.1 (was 1.9.1)

(for what it's worth, I was also hoping to include tests for "json-marshaller", but lack of documentation coupled with seeming inability to parse directly from stream or even String suggested that it's not yet mature enough to be included)

2. Test system

I switched over to the light side at home (replaced my old Athlon/Linux box to a small spunky Mini-Mac, yay!), so the test box has a 2.53 GHz Intel Core 2 Duo CPU. This is somewhat box; it seems to be about 4x as fast for this particular task. Test is single-threaded, so it would be possible to roughly double the throughput with different Japex test setup; however, test threads are CPU-bound and independent so seems to be little point in adding more concurrency.

Test code is available from Woodstox SVN repository (under "staxbind" folder), and runs on Japex. Converters for libraries are rather simple; data consists of medium-sized (20k) documents for anyone interested in replicating results.

3. Pretty Graphs

Ok, here is the main graph:

Data Binding Performance Graph

and for more data, check out the full results.

4. "But what does it MEAN?"

As with previous tests, upper part of double-graph just indicates amount of data read and/or written (which is identical for first three, but flex-json seems to insist including some additional type metadata), which can be ignored; the lower-graph indicates through-put (higher bar means faster processing) and is the interesting part.

There are three tests; read JSON (into data object(s)), write JSON (from data object(s)) and read-then-write which combines two operations. I use last result, since it gives reasonable approximation for common use with web services where requests are read, some processing done, and a response written.

From graph it looks like results have not changed a lot; here is the revised speed ratio, using the slowest implementation (Gson) as the baseline:

Impl Read (TPS) Write (TPS) Read+Write (TPS) R+W, times baseline
Jackson (automatic) 7240.202 9161.873 4023.464 14.75
FlexJson 721.743 1402.848 462.594 1.69
Json-tools 524.119 1007.068 341.123 1.25
GSON 714.106 462.935 272.637 1

5. Thoughts?

Not much has changed; Jackson is still an order of magnitude faster than the rest, and relative ranking of the other libraries hasn't changed either.

Friday, October 29, 2010

Faster compression on Java platform with LZF

One open source project that I have worked on a bit lately is compress-lzf, a simple Java library that implements block- and stream-accessible version of LZF compression algorithm. LZF algorithm itself is very simple, and mostly (only?) documented here, in form of working reference implementation that doubles as documentation.

Our implementation is a simple rewrite of C version, and fully compatible with the reference implementation including file header and block signature (this is different from the other existing Java version that projects like H2 include). While simple implementation, it is not naive translation; and some effort has been spent on optimizing its performance.

And as should be the case for such low-level libraries, there are no external dependencies beyond JDK 1.5; meaning it should work on variety of platforms, including Android (in fact it could probably be compiled on any 1.x JDK)

1. What is LZF?

LZF compression algorithm is basic Lempel-Ziv, with little additional processing; so one way to think of it is that it is "gzip without Huffman encoding" (gzip uses Deflate, which is Lempel-Ziv followed by Huffman encoding, with variable bit-length).
And Lempel-Ziv is just method whereby one finds repeating sub-sequences in output, replacing sequences of three or more "already seen bytes" with a back-reference; matches are typically stored in a hash table indexed by hash of first 3 bytes, value being offset of last instance of potentially matching sequence.

It is worth noting that many other similar (and sometimes similarly named) algoritms likewise use basic LZ; "Fastlz" for example is very similar.

2. Gzip exists, why bother with LZF?

But why does this matter? If gzip is this and more, why would you want to use a "lesser" algorithm?

Key benefit is that of speed: LZF can be up to twice as fast to uncompress; but where it REALLY shines is compression speed -- here it can be 2x or 3x as fast (writing a stand-alone benchmark is on my todo list, so hopefully I will get to talk a bit more about this in future). Performance difference is mostly due to relatively high overhead that deflate uses for non-byte-aligned output (which is needed for optimal Huffman-encoding of codewords) -- while it definitely helps compression, it is expensive in terms of CPU usage.

LZF compression rate is lower than with gzip, but compression profile is similar: things that compress well with gzip also compress well with LZF; and those that compress less do so with LZF as well. This makes sense given that most compression often comes from LZ algorithm, which is implemented similarly.

So it all comes down to this: gzip compression is very CPU-intensive (read: slow); so much so that you typically use 5x as long to compress things like JSON data than writing content itself. With fast networks and I/O this often becomes bottleneck.
LZF can be a good alternative since its CPU overhead is much more modest; and it can still compress "fluffy" content pretty well -- where gzip can compress something by, say, 90%, LZF usually gets to 75-80% ratio (this obviously varies by types of content; basically depends on which part of deflate algorithm is more effective)

3. Usage

Assuming that above sounds interesting, you can get compress-lzf library either from GitHub; or via Maven. In latter case, information is:

  • <groupId>com.ning</groupId>
  • <artifactId>compress-lzf</artifactId>
  • current version at time of this writing is "0.6.0"

And from there you can use class from com.ning.compress.lzf package:

  • LZF is a simple command-line utility you can use for manual compression/uncompression (jar defines this class as the Main-Class so you can just invoke it with "java -jar compress-lzf.jar" to get usage information)
  • LZFInputStream and LZFOutputStream are compressing input (read LZF compressed data) and output (write LZF compressed data) streams, similar to JDK's GZIPInputStream and GZIPOutputStream
  • LZFEncoder and LZFDecoder are block-based codecs that can be used to encode (compress) and decode (uncompress) chunks (byte[]) of data

Both streaming and block-based interfaces have their uses; for example, when storing BLOBs in database, block API works better; and when reading data over network or from filesystem, streams make more sense.

3.1 Note: avoid LZFOutputStream.flush() ("just let it float...")

One note on LZFOutputStream: be careful with flush()ing -- just like other block compressors (including gzip), LZF can only do back-references within single block; and since LZFOutputStream.flush() forces closing of an open block, it can dramatically reduce compression rate. As such it is best to avoid flush()ing if at all possible, unless you actually want to create a logical boundary at some point.

4. Future plans for the library

While there are always improvements for any Open Source projects, I think the natural scope for LZF library is quite small. Similar to Java Uuid Generator, it is quite possible that there won't be need for major changes to implementation or API; which means that future improvements are most likely in areas of integration; helping other projects use LZF.

For example, given unique CPU/io/bandwidth trade-off LZF can offer, it could make a really nice component for large-scale data transfer and processing, for example:

  • Data processing with map/reduce (Hadoop)
  • Data storage (with frameworks like jDBI, Hibernate; or with NoSQL projects like Voldemort, Cassandra)
  • Data aggregation (logging, log processing, Scribe)
  • Caching: use LFZ with memcached?
  • Combination of high-performance data formats (Jackson Smile, binary JSON serialization) and compression -- can be combined with other use cases too.

So it will be interesting to see where things go in near future.

5. On benefits of working for a company that is comfortable with Open Source

Unlike with many earlier Open Source projects that I have worked on, much of work that went into this open source library was done on company time. It also was a team effort (Jon wrote the streaming handling, I wrote block codec); compared to most other open source projects where I have been the main author. So I have come a long way, having worked for a semi-OS-hostile company, then a bit for a neutral (agnostic?) company; and now working for one hat is actually engaged with community (Ning). And I must admit that I really like that Ning is openly committed to working with open source projects, not just using open source artifacts others make. This is not solely meant as an ad for Ning (although we are indeed hiring...), I just think it is important to give credit where it is due.

Beyond this, Ning section at Github already has some public projects, and there are literally a dozen more being cleaned up to be ready for publishing. Within just a few more months I think Ning will be more recognized for its contributions in Open Source space; and eventually become a signifant player. I see this as a major benefit, making it bit less difficult to hire the best developers; and especially so for a company that is still relatively small, and thus having less name recognition than the big boys.

Thursday, October 14, 2010

More on Java UUID Generator (JUG), a word on performance

Ok, so Java UUID Generator version 3 is out. Beyond question of what is new with this version (which I outlined in the previous entry), the next asked was "how is performance?".
My first thought was that it should be about the same as what version 2.0 has; but after some more thinking I realized that this is not a very useful answer to give.

So I wrote a simple micro-benchmark to find out (it's in 'src/main/java/perf/MeasurePerformance.java') more about performance aspects. Test uses three main generation methods (random-, time- and name-based) for JUG, and matching alternatives that java.util.UUID has (random- and limited named-based variant).

1. Performance of UUID generation, general observations

High-level summary would be:

  1. Performance between equivalent methods between JDK and JUG have similar performance (no surprise) -- JDK just does not provide time-based variant (one most commonly used by non-Java systems); nor ability to configure Random-based variant
  2. Performance between three main generation types varies a lot.

2. Actual numbers

So how big are differences? Here is actual snapshot from running the test on my MacBook: each run generates 250,000 UUIDs, and time taken (after test ran for about a minute) is reported as milliseconds:

  Test 'JDK, random' -> 974 msecs
  Test 'JDK, name' -> 230 msecs
  Test 'Jug, SecureRandom' -> 971 msecs
  Test 'Jug, java.util.Random' -> 33 msecs
  Test 'Jug, time-based' -> 39 msecs 
  Test 'Jug, name-based' -> 256 msecs

Since these values fluctuate a bit, exact values are not as important as ratios. Random-based variants are slowest for both implementations (except for case of using 'java.util.Random' -- more on this in a bit); name-hash (MD5 by default; SHA-1 available with Jug too) methods are bit faster; and time/location-based variant is rather fast, close to 50% of maximum theoretical per-node generation rate.

So generation rates for this system would be:

  • Random-based method (using SecureRandom) generates about 250k UUIDs per second
  • Name-based variant (MD5) without namespace is almost 4 times as fast, generating about 1 million UUIDs per second
  • Time-based variant (Ethernet address plus timestamp) is much faster -- almost 20 times as fast as Random-based default variant -- generating about 5 million UUIDs per second. This is also close to theoretical maximum rate for the method, which is 10 million UUIDs per second (sustained), because timestamp granularity (100 nanoseconds) prevents higher throughput (to work around this, one can generate parallel generators using different Ethernet address)

Quite a difference! It is worth noting, too, that speed of name-based method is related to both existence of namespace (this test did not use one; adding one slows things down slightly) and length of name used; latter because hash is calculated on name used. And finally, hash method used also matters (SHA-1 is slower than MD5).

But what is up with the random variant? And didn't I forget one impressive number from my analysis above?

3. Why is Random-based generation so slow?

Constructing "secure" pseudo-random numbers is not easy, and this is the main reason for slowness. Looking at stack traces, it looks like JDK java.security.SecureRandom actually uses random device provided by the underlying operating system; and this is part of the slowness (due to use of JNI). But generation itself is fundamentally slow, even if no system calls were needed. It is quite possible that faster machines might not even speed up performance significantly.

4. Couldn't we just use 'java.util.Random' to speed things up?

But if the "secure" (i.e. "good", meaning one providing good level of randomness) variant is slow, couldn't we use a faster one? Like "java.util.Random"? In fact, didn't numbers above show using it resulting in version as fast as using time-based variant?

Yes, this is possible: and JUG allows you to pass any java.util.Random instance you want (it just defaults to SecureRandom).
But speed is not everything: the main point of UUIDs is the uniqueness, and this is why SecureRandom is used as the default by both JUG and JDK. And while the default pseudo-random number generator that java.util.Random uses is fine for many uses, I am not sure I would trust its simple algorithm for generating UUIDs; even if initialized using system clock (System.currentTimeMillis()). The underlying algorithm just does not have any actual source of randomness beyond initial seed value and I do not trust that its random number distribution, then, is good enough to give strong guarantees of uniqueness (I would love pointers to someone discussing actual measurements or research on this!).

So to me downgrading to using simple text-book pseudo-number algorithms does not seem tempting, even if it speeds things up by an order of magnitude.

5. So what would make more sense?

Well, looking at numbers, how about consider time-based variant? Name-based variant is not quite as useful; not so much due to performance aspects, but since it actually requires you to hand source of uniqueness (name to hash).

6. Thoughts on relevance of above

I was bit surprised by how big the difference actually is. But whether it really matters depends on use case -- oftentimes you don't need even thousands of UUIDs per second (most services can not really serve even one thousands requests per second), so the rate of a quarter million per second is just fine. After all, it only takes about four microseconds (on average) to generate one UUID.

But there are use cases where this matters more; especially if UUIDs are used liberally, for example using UUIDs as identifiers for all business objects generated during processing.
So if your use case truly depends on (or at least benefits from) fast generation of UUIDs, you probably want to use time-based generation method from Jug.

7. How did I do that again?

Ok; since I haven't yet gotten Javadocs set up, here is the way to generate time-based UUIDs.

  // need to pass Ethernet address; can either use real one (shown here)
  EthernetAddress nic = EthernetAddress.fromInterface();
  // or bogus which would be gotten with: EthernetAddress.constructMulticastAddress()
  TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(nic);
  // also: we don't specify synchronizer, getting an intra-JVM syncer; there is
  // also external file-locking-based synchronizer if multiple JVMs run JUG
  UUID uuid = uuidGenerator.generate();

And that's it. Other variants are generated similarly, by creating generator instance with Generators factory.

Monday, June 28, 2010

Async HTTP Client 1.0 released

Ok, this announcement went out last week, but it is important to re-iterate: version 1.0 of Ning async HTTP Client is now out!

This is good news for multiple reasons: obviously ability to do asynchronous (read: non-blocking, i.e. no need to use one thread for each and every open connection) HTTP communication is valuable in itself. But I also hope that this spurs some friendly competition in improving all HTTP-based communication on Java platform -- there are still features that are not implemented, and due to complexity of efficient connection handling, there should be still room for improvement with performance as well. And finally this should lead to wider adoption of this relatively new library, so that it gets properly battle-tested and proven.

I am actually planning on using this client for cases where regular blocking client could work, to see how well it performs and exactly how easy it is to use non-blocking API, compared to existing blocking alternatives.

Tuesday, June 08, 2010

Introducing My Newest Open Source project: Tr13 (that is so 1337!)

I have briefly mentioned that I have been working on a new little low-level data structure package, called (Ning) Tr13, started at GitHub. Now that code is getting ready and tested, it is time to "formally" introduce it.

Tr13 is an interesting project form multiple angles: for one, it is my first "corporate-sponsored" Open Source project (i.e. related to may day-to-day work -- although not focus of it, just a useful thing for me and hopefully my colleagues at Ning). This alone is worth celebrating; the whole process has been remarkably smooth from 15-minute end-to-end approval process (take note Amazon developer community...), tosteady if somewhat slow development itself (there is nothing groundbreaking from CS perspective -- more on this below).
This is also a rare case of my already having reasonably clear idea as to where I would be using this package. It may sound surprising, but truthfully most open source libraries I have written, I have not actually absolutely needed them before writing. :-)
(there are exceptions: Java UUID Generator is something I wrote since I needed it -- but not so for Woodstox, or even Jackson, to some degree, both of which I have extensively used after I wrote them).
And finally, well, it is a change from my more common data format parsing / data binding work. Although I was thinking of "climbing up the stack" to write something higher level, I ended up going the other way, a slight detour.

Anyway, I digress. So...

What exactly is tr13?

It is a space-efficient ("compact") read-only implementation of Trie data structure, with slight modification to use Patricia/Radix structure for leaves (but not for branches, since that did not seem useful for my use cases).

Current implementation supports two kinds of Tries:

  1. Byte[] key, variable integer (~= long) value
  2. Byte[] key, byte[] value

In both cases, byte[] is typically a String, but can be any other serializable thing; for example, ints and longs could be trivially converted to 4/8 byte keys. Separate variant for variable-length integers is further optimization for my initial use case, where values are generally small ints, which can fit in a single byte, but where it is convenient not to be bound by arbitrary limits (like 256).

Internally structure is actually just a regular byte array (or, alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format needs to be documented, but is a rather simple implementation of basic trie ideas, nothing to write a dissertation about.

Why does it Rock?!

Given how good plain old JDK HashMap can be -- you could just use HashMap<String,Integer> for the first case -- what is so special about tr13?

Simply put: memory usage, or lack thereof (ditto for GC activity). This because:

  1. Overall memory usage is pretty tight -- for test data I have (30 million entries of typically 13-byte key, small integer number == 600 meg input file), trie structure actually uses less memory than its file representation (by about 35% less). And,
  2. Since it is all in just TWO objects (raw byte array, wrapper Trie object), it should keep GC rather happy.

This is not coincidental, since the use case we have is that of sizable routing tabls that should ideally be kept fully in-memory. Hence compactness of data structure, with reasonable lookup times (i.e. no linear search), was design criteria. Even other somewhat frugal data structures (like b-trees) are not anywhere as space efficient.

Downsides?

First obvious limitation is that it is a read-only data structure: you must have all the data at hand when building it. Data also has to be ordered, not necessarily alphabetically (although that is the usual way), but in a stable lexicographic ordering, such that build process has invariants it needs. This is not as big a limitation as it might seem, at least when merging functionality (more on this later on) is added. For now it may be necessary to re-build data structure regularly, and keep an additional "short-term" lookup data structure at hand for additions that are needed between re-builds.

As to performancem resulting data structure is not the fastest way to find things: from complexity standpoint it is a linear algorithm, compared to theoretically constant time for hash-based alternatives. This may or may not matter; it should not matter a lot with respect to key length (HashMap has to calculate hash for key and compare key bytes, unless keys can be canonicalized). Another potential performance concern is that trie data structure does not have much locality; but this too is unlikely to be much worse than for typical alternatives. In fact, due to high memory efficiency, and use of plain byte array, it is not guaranteed that HashMap has any better locality, even if a single lookup results in a match (this due Java GC's habit of "shuffling" object data around).

All in all, Tr13 does much more work when finding values, since it traverses data structure once per each key byte (except for Patricia-optimized tail, which may be multi-byte linear match). So I would not expect it to be as fast as direct hash lookups.

But in the end, these are speculations, and the real test would be use.

... or, how about a performance test?

Performance Testing

Ok: the question of efficiency is something worth looking deeper into. And so I wrote a benchmark that compares VIntTrieLookup (byte[]-to-vint trie) against HashMap<String,Integer> (for curious, it's under 'src/java/test/manual/' in source tree). Test loads all the entries in memory, and then performs randomized lookups (one lookup for every key, but order shuffled so as to be different from 'natural' ordering of trie).

When given a subset of the whole data file (turns out HashMap hogs so much memory that I just can't test the whole data set!) -- a 32 meg, 2 million entry chunk -- here are the results:

Memory usage:

  1. Tr13: 20.6 megs (reduction of about one third)
  2. HashMap<String,Integer>: 314 megs

So in this case, HashMap<String,Integer> uses about 15 times more memory than Tr13 (note: Integer values are canonicalized using Integer.valueOf -- without this, usage would be 360 megs).

And what about performance? What is the cost of compactness?

For 3 million lookups, time taken was:

  1. Tr13: about 5000 milliseconds (about 400,000 lookups per second)
  2. HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000 lookups per second)

so HashMap is about 4 times faster than Tr13 for lookups, on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.

Since my first need is to be able to do lookups in memory, performance difference is not of particular importance (there might be one or two lookups per request, so anything in tens of thousands per second would work), but it is good to know rough order of magnitude difference.

What next?

At this point Tr13 should be ready for use, and I plan to release version 0.5 soon. The only thing I am waiting for is to get a Maven repository set up at Sonatype OSS repository (kudos to Sonatype for this free hosting!), so that Maven-based (and I assume, Ivy-based) projects can use it. This is actually what I need for my daytime use.

But beyond Maven deployment, there are some additional features I want to implement in near future, such as:

  • Ability to traverse through entries (Map.Entry<> style)
  • Ability to merge two tries efficiently -- could be built quite easily given iterator functionality
  • ... document the data structure (ok ok, not a development task, but "hard" (read: tedious) yet necessary work) -- it is very simple, but not so simple that reading code would make it obvious.
  • Current implementations are limited to 2 gigs, due to Java byte[] and ByteBuffer limitations. But fundamentally there is nothing particularly hard about writing segmented versions where only segments would have this limitation (well, probably values too, but I don't see any need for 2gb+ values in my tries...)

Merging functionality specifically would be useful for "semi-mutable" lookup data structures. For data that evolves slowly (which is the case for routing table use case I have), it is possible to keep track of needed modifications (in a, say, HashMap), and do primary lookup against that data structure, and only if no match is found, against immutable trie. And when this short-term data structure goes enough, or enough time passes, trie could be re-built with existing and new data. Repeat and rinse.

And perhaps it might even be possible to work some more on fundamental data structure itself, and find ways to make it mutable. Insertions (and deletion) operations would be rather costly, due to global changes needed (another downside of compactness is that changes tend to spill far). But even slow mutability would be better than no mutability; and perhaps there could be ways to trade some space (intentional unused regions within trie, to be used for insertions?) for lesser likelihood of having move all the data around.

Wednesday, April 28, 2010

Making Java Reflection field access 40 times faster

Think it can not be done? Think again!

Here is a useful nugget of information that should be common knowledge among performance-conscious Java developers:but based on code I've seen in the wild it might not yet be, hence this entry.

Basically, if you just do this:

  Field f = MyBean.class.getDeclaredField("id");
  f.set(beanInstance, valueForField);

performance stinks -- setting value is about hundreds of times slower than direct field access. But if you add this little call:

  f.setAccessible(true);

in there, overhead is drastically reduced. In a simple micro-benchmark running on my old desktop speed difference is 40x, and while YMMV, it will be an order of magnitude or more and this ratio does not seem to improve with new JVMs (has been steady with 1.4, 1.5 and 1.6). This actually takes overhead down enough so as to make reflection-based access "fast enough" for most use cases: Jackson for example just uses it and does not bother with byte-code generation to create most optimal access (but who knows? maybe it will one day).

So what is the difference? Access checks that would be done by default are heavy-weight, as can be seen from profiler stack traces and JDK sources. And all setAccessible() method does is set a boolean flag that will by-pass these checks. Works for Methods and Constructors as well, although with slightly less pronounced results.

Anyway, thought others might find this useful.

Sunday, March 21, 2010

Naively Simple, and Very Powerful: WebSockets!

I don't know how I have managed to miss the next "Simple Big Thing", but I noticed that something called WebSockets has been around for months. And yet I only now bumped into (via announcement for Jetty 8, which boasts support for Servlet 3 API as well as WebSockets). Better late than never I guess.

The idea with WS is very simple, and even somewhat elegant: start up with a regular HTTP connect with hand shake, and then convert it to what amounts to a socket. And use simple properties of UTF-8 encoding to provide simple and inexpensive framing mechanism. Start and end markers used are bytes 0x00 and 0xFF which can not be included in UTF-8 encoded textual data (along with couple of other bytes); and this mostly guarantees that you can nicely included structured textual data using XML or JSON. In fact, elegance of the solution reminds me of UTF-8 encoding which also is surprisingly well designed and simple. In fact, ability of WebSockets to do what it is doing is yet another testament for pure goodness of UTF-8 encoding.

Anyway, this is just a heads up, just in case some of you may have missed this important development. Believe me, this is going to be a big thing very soon. Comet begone; WebSockets is the way forward for efficient low-latency stateful interaction!

Thursday, December 17, 2009

On good, efficient data formats

There are 2 fairly recent additions to category of "good binary data formats that work nice with Java" category: Avro and Kryo. I have meant to write something about both for a while.

1. Avro

Avro is a simple and efficient general-purpose data format, developed as part of Hadoop project. Due to its background, it should work very well with Hadoop (and map/reduce systems in general). It is also quite similar to what I had been thinking of implementing for my own (well, my employer's, rather) large-scale data processing needs, when there was no Avro. From my shallow understanding of Avro, it seems to nicely fit the bill for data format for huge sequence of records; but with self-describing property that is sadly lacking from other contestants like Google's protobuf.

I will hopefully have some more tangible notes to add in future: for now it's enough to note that Avro's performance seems to be pretty good, at least in the "thrift-protobuf" benchmark.

2. Kryo

Strictly speaking Kryo is not a data format, but rather Java object serialization framework that happens to define a data format to handle its main task. But since sending POJOs back and forth over the wire is a very common (perhaps the most common) task for data formats, in Java world, this is not a big difference.

First thing I noticed was its good performance (on above-mentioned benchmark). This is nice, since there is already JDK default serialization that performs adequately for most tasks; so anything else that does binary serialization should be able to meet and beat that performance baseline, to be of interest. But as importantly, API seems straight-forward, simple, and adequately customizable.
So I have reasonably high expectations for this library -- it could be nice complement to something like, say, Dirmi (RMI alternative for JDK default one -- should be coupled with alternative, similarly improved, serialization mechanism, n'est pas?).

3. Disclaimer

Alas, I have not made up compelling use case for using these two projects, yet. But given promise they hold, I should be able to test them out come next year.

Tuesday, December 08, 2009

JSON data binding performance (again!): Jackson / Google-gson / JSON Tools... and FlexJSON too

(note: this is a follow-up on an earlier measurements)

1. A New Contestant: FlexJson

After realizing that FlexJson is actually capable of both serialization and deserialization (somehow I thought it would only serialize things), I decided to add it as the fourth contestant in the "full service Java/JSON data binding" category of tests.

Initially I was bit discouraged to find that it makes one rookie mistake: assumes that somehow JSON comes in (and goes out) as Java Strings. But aside from this glitch, package actually looks quite solid -- and its exclusion/inclusion mechanism looks interesting. Maybe not exactly my cup of joe (if it was, after all, Jackson API would look more like it does), but a viable alternative. And I can see how ability to prevent deep copy would come in handy sometimes. And finally, some of the features actually exceed what Jackson can currently do, regarding polymorphic deserialization (since FJ includes class name by default, I assume it can do it) and some level of cyclic-dependency handling (ignoring serialization of cyclic references at least).

So let's see how "rookie" (yes, I know, it's not exactly a new package, just new addition to the test) fares...

2. Test setup

Tests are run using nice Japex performance test framework, running on my somewhat old AMD work station (~1700 Ghz Athlon -- someone needs to click on those right-hand-side ads to get me a new performance-testing work station! :-) ).

Input data used consists of serialization of tabular data (database dump, good old "db100.xml" used by countless xml tests), converted to Java POJOs, and then to individual data formats (here as JSON, but can be tested as XML and whatnot). Document size is 20k in XML, and slightly less in JSON (about 16k). It would be easy to run using other data sets, but in the past, performance ratios for 2k, 20k and 200k documents have not had radical differences, so 20k one seems like a reasonable choice (but note that the earlier benchmark did in fact use 2k documents, so actual numbers do differ).

Test project itself, "StaxBind" is still in Woodstox SVN repository, accessible via Codehaus SVN page. (one of these days I should just create a Github project -- but not today).

Versions of JSON processing packages are as follows:

  • Jackson 1.2.0
  • Google-gson 1.4
  • Json-tools-core 1.7
  • Flexjson-1.9.1

Code for each library is using default settings, and using what appears as the most efficient interface, for cases where transformations are from byte streams on server side (byte streams in, byte streams out).

3. Results

First things first: here's the money shot:

Data Binding Performance Graph

(or check out the full results for details)

Another way to represent results is by showing performance ratios, using the slowest implementation as base line (TPS == transactions per second; number of times a 20k document is read, written, or both):

(note: Jackson/manual is omitted since it is hand-written (if simple) serializer/deserializer, and there are no direct counterparts for other packages -- while it would give even bigger faster-than-thou ratio, it wouldn't be a fair comparison)

Impl Read (TPS) Write (TPS) Read+Write (TPS) R+W, times baseline
Jackson (automatic) 1599.272 2463.097 1033.809 25.6
FlexJson 125.277 125.277 94.904 2.35
Json-tools 94.051 126.954 49.008 1.2
GSON 56.58 112.455 40.38 1

So looks like our "new kid on the block" manages to outperform the other two non-Jackson JSON processors here. And at least get within an order-of-magnitude with Jackson... :-)

4. Musings

So it turns out that despite its interfacing (those String/byte conversions), Flexjson package manages to work more efficiently than some other packages that claim "simplicity and performance". And this without actually claiming to be particularly performant, but rather focusing on design of API and ease-of-use aspects. Pretty neat, I respect that.

5. Next?

My current main interest (with respect to performance issues) lie in the area of compressing data for transfer: after all, most of the time there is relative abundance of CPU power compared to available network and I/O bandwidth. This means that trading some CPU (needed for compression and decompression) seems like a bargain for many use cases.

But on the other hand, as we saw earlier, the question is "how much is too much". And that's where my new favorite simple-and-fast algorithm, LZF, comes in. But that's a different story.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.