Tuesday, February 15, 2011

Basic flaw with most binary formats: missing identifiable prefix (protobuf, Thrift, BSON, Avro, MsgPack)

Ok: I admit that I have many reservations regarding many existing binary data formats; and this is major reason why I worked on Smile format specification -- to develop a format that tries to address various deficiencies I have observed.

But while the full list of grievances would be long, I realized today that there is one basic design problem that is common to pretty much all formats -- at least Thrift, protobuf, BSON and MsgPack -- that is: lack of any kind of reliable, identifiable prefix. Commonly used techniques like "magic number", which is used to allow reliable type detection for things like image formats appears unknown to binary data format designers. This is a shame.

1. The Problem

Given a piece of data (file, web resource), one important piece of metadata is its structure. While this is often available explicitly from the context, this is not always the case; and even if it could be added there are benefits to being able to automatically detect type: this can significantly simplify systems, or to extend functionality by accepting multiple kinds of formats. Various graphics programs, for example, can operate on different image storage formats, without necessarily having any metadata available beyond just actual data.

So why does this matter? It helps in verifying basic correctness of interacton in many cases: if you can detect what is and what is not valid piece of data in a format, life is much easier: you have a chance to know immediately when piece of data is completely corrupt, or you are being fed data in some format than the one you expect. Or, if you support multiple formats, you can add automatic handling of differences.

2. Textual formats do it well

But let's go back to commonly used textual data formats: XML and JSON. Of these, XML specifies "xml declaration" which can be used to not only determine text encoding (UTF-8 etc) used but also the fact that data is XML. It is cleanly designed and is simple to implement. As if it was designed by people who knew what they were doing.

JSON does not define such a prefix, but specification does specify exact rules for detecting valid JSON, as well as encodings that can be used; so in practice JSON auto-detection is as easy to implement as that for XML.

3. But most new binary formats don't

Now; the task of defining unique (enough) header for binary formats would be even easier than that for textual formats, because structurally there is less variance: no need to allow variable text encoding, arbitrary white spaces, or other lexical sugar. It took me very little time to figure out the simple schema used by Smile to indicate its type (which in itself was inspired by design of PNG image format, an example of very good data format design).

So you might think that binary formats would excel in this area. Unfortunately, you would be wrong.

As far as I can see, following binary data formats have little or no support for type detection:

  • Thrift does not seem to have type identifier at its format layer. There is actually small amount of metadata at RPC level (there is a message-start structure of some kind), but this only helps if you want/need to use Thrift's RPC layer. Another odd things is that internal API actually exposes hooks that would be used to handle any type idenfitiers; it is as if designers were at least aware of possibility of using some markers to enclose main-level data entities.
  • protobuf does not seem to have anything to allow type detection of a given blob of protobuf data. I guess protobuf never claimed to be useful for anything beyond tightly coupled low-level system integration (although some clueless companies are apparently using it for data storage... which just plain old Bad Idea), so maybe I could buy argument that this is just not needed, that there is never any "arbitrary protobuf data" around. Still... adding a tiny bit of redundancy would make sense for diagnostics purposes; and given that protobuf already has some redundancy (field ids, instead of using ordering) it would seem acceptable to use first 2 or 4 bytes for this.
  • MsgPack and BSON both just define "raw" encoding, without any format identifier that I can see. This is especially puzzling since unlike protobuf and Thrift, they do not require a schema to be used; that is, they have plenty of other metadata (types, names of struct members; even length prefixes). So make these data formats completely unidentifiable?

4. But what about Avro?

There is one exception aside from Smile, however. Avro seems to do the right thing (as far as I can read the specification) -- at least when explicitly storing Avro data in a file (I assume including map/reduce use cases, stored in HDFS): there is a simple prefix to use, as well as requirement to store the schema used. This makes sense, since my biggest concern with formats like protobuf and Thrift is that being "schema-ridden", data without schema is all but useless. Requiring that two are bundled -- when stored -- makes sense; optimizations can be used for transfer.

So Avro definitely seems better design than 4 other binary data formats listed above in this respect.

5. Why do I care?

As part of my on-going expansion of Jackson ("the universal data processor"), I am thinking of adding many more backends (to support reading and writing data in alternate data formats), to allow clean and efficient data binding to/from most any commonly used data formats. Ideally this would include binary data formats. Current plans are to include format detection functionality in such a way that new codecs can detect data they are capable of reading and writing; and this will work just fine for most existing formats that Jackson can handle (JSON, Smile, XML). I also assumed that since it would be very easy to design data formats that can be reliably detected, existing formats should be a piece of cake to detect. It is only when I started digging into details of binary data formats that the sad reality sunk in...

On plus side, this makes it easier to focus on adding first rate support for data formats that are easy to detect. So I will probably prioritize Avro compatibility significantly higher than others; and I will unfortunately have to downgrade my work on adding Thrift support which would otherwise be the most important "alien" format to support (due to existing use by infrastructure I am working on).

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.