Thursday, December 17, 2009

On good, efficient data formats

There are 2 fairly recent additions to category of "good binary data formats that work nice with Java" category: Avro and Kryo. I have meant to write something about both for a while.

1. Avro

Avro is a simple and efficient general-purpose data format, developed as part of Hadoop project. Due to its background, it should work very well with Hadoop (and map/reduce systems in general). It is also quite similar to what I had been thinking of implementing for my own (well, my employer's, rather) large-scale data processing needs, when there was no Avro. From my shallow understanding of Avro, it seems to nicely fit the bill for data format for huge sequence of records; but with self-describing property that is sadly lacking from other contestants like Google's protobuf.

I will hopefully have some more tangible notes to add in future: for now it's enough to note that Avro's performance seems to be pretty good, at least in the "thrift-protobuf" benchmark.

2. Kryo

Strictly speaking Kryo is not a data format, but rather Java object serialization framework that happens to define a data format to handle its main task. But since sending POJOs back and forth over the wire is a very common (perhaps the most common) task for data formats, in Java world, this is not a big difference.

First thing I noticed was its good performance (on above-mentioned benchmark). This is nice, since there is already JDK default serialization that performs adequately for most tasks; so anything else that does binary serialization should be able to meet and beat that performance baseline, to be of interest. But as importantly, API seems straight-forward, simple, and adequately customizable.
So I have reasonably high expectations for this library -- it could be nice complement to something like, say, Dirmi (RMI alternative for JDK default one -- should be coupled with alternative, similarly improved, serialization mechanism, n'est pas?).

3. Disclaimer

Alas, I have not made up compelling use case for using these two projects, yet. But given promise they hold, I should be able to test them out come next year.

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.