Tuesday, February 15, 2011

Basic flaw with most binary formats: missing identifiable prefix (protobuf, Thrift, BSON, Avro, MsgPack)

Ok: I admit that I have many reservations regarding many existing binary data formats; and this is major reason why I worked on Smile format specification -- to develop a format that tries to address various deficiencies I have observed.

But while the full list of grievances would be long, I realized today that there is one basic design problem that is common to pretty much all formats -- at least Thrift, protobuf, BSON and MsgPack -- that is: lack of any kind of reliable, identifiable prefix. Commonly used techniques like "magic number", which is used to allow reliable type detection for things like image formats appears unknown to binary data format designers. This is a shame.

1. The Problem

Given a piece of data (file, web resource), one important piece of metadata is its structure. While this is often available explicitly from the context, this is not always the case; and even if it could be added there are benefits to being able to automatically detect type: this can significantly simplify systems, or to extend functionality by accepting multiple kinds of formats. Various graphics programs, for example, can operate on different image storage formats, without necessarily having any metadata available beyond just actual data.

So why does this matter? It helps in verifying basic correctness of interacton in many cases: if you can detect what is and what is not valid piece of data in a format, life is much easier: you have a chance to know immediately when piece of data is completely corrupt, or you are being fed data in some format than the one you expect. Or, if you support multiple formats, you can add automatic handling of differences.

2. Textual formats do it well

But let's go back to commonly used textual data formats: XML and JSON. Of these, XML specifies "xml declaration" which can be used to not only determine text encoding (UTF-8 etc) used but also the fact that data is XML. It is cleanly designed and is simple to implement. As if it was designed by people who knew what they were doing.

JSON does not define such a prefix, but specification does specify exact rules for detecting valid JSON, as well as encodings that can be used; so in practice JSON auto-detection is as easy to implement as that for XML.

3. But most new binary formats don't

Now; the task of defining unique (enough) header for binary formats would be even easier than that for textual formats, because structurally there is less variance: no need to allow variable text encoding, arbitrary white spaces, or other lexical sugar. It took me very little time to figure out the simple schema used by Smile to indicate its type (which in itself was inspired by design of PNG image format, an example of very good data format design).

So you might think that binary formats would excel in this area. Unfortunately, you would be wrong.

As far as I can see, following binary data formats have little or no support for type detection:

  • Thrift does not seem to have type identifier at its format layer. There is actually small amount of metadata at RPC level (there is a message-start structure of some kind), but this only helps if you want/need to use Thrift's RPC layer. Another odd things is that internal API actually exposes hooks that would be used to handle any type idenfitiers; it is as if designers were at least aware of possibility of using some markers to enclose main-level data entities.
  • protobuf does not seem to have anything to allow type detection of a given blob of protobuf data. I guess protobuf never claimed to be useful for anything beyond tightly coupled low-level system integration (although some clueless companies are apparently using it for data storage... which just plain old Bad Idea), so maybe I could buy argument that this is just not needed, that there is never any "arbitrary protobuf data" around. Still... adding a tiny bit of redundancy would make sense for diagnostics purposes; and given that protobuf already has some redundancy (field ids, instead of using ordering) it would seem acceptable to use first 2 or 4 bytes for this.
  • MsgPack and BSON both just define "raw" encoding, without any format identifier that I can see. This is especially puzzling since unlike protobuf and Thrift, they do not require a schema to be used; that is, they have plenty of other metadata (types, names of struct members; even length prefixes). So make these data formats completely unidentifiable?

4. But what about Avro?

There is one exception aside from Smile, however. Avro seems to do the right thing (as far as I can read the specification) -- at least when explicitly storing Avro data in a file (I assume including map/reduce use cases, stored in HDFS): there is a simple prefix to use, as well as requirement to store the schema used. This makes sense, since my biggest concern with formats like protobuf and Thrift is that being "schema-ridden", data without schema is all but useless. Requiring that two are bundled -- when stored -- makes sense; optimizations can be used for transfer.

So Avro definitely seems better design than 4 other binary data formats listed above in this respect.

5. Why do I care?

As part of my on-going expansion of Jackson ("the universal data processor"), I am thinking of adding many more backends (to support reading and writing data in alternate data formats), to allow clean and efficient data binding to/from most any commonly used data formats. Ideally this would include binary data formats. Current plans are to include format detection functionality in such a way that new codecs can detect data they are capable of reading and writing; and this will work just fine for most existing formats that Jackson can handle (JSON, Smile, XML). I also assumed that since it would be very easy to design data formats that can be reliably detected, existing formats should be a piece of cake to detect. It is only when I started digging into details of binary data formats that the sad reality sunk in...

On plus side, this makes it easier to focus on adding first rate support for data formats that are easy to detect. So I will probably prioritize Avro compatibility significantly higher than others; and I will unfortunately have to downgrade my work on adding Thrift support which would otherwise be the most important "alien" format to support (due to existing use by infrastructure I am working on).

Wednesday, June 30, 2010

No Comments

Ok, another minor but important change that just occured is that at present there is unfortunately no way to comment on my writing style, content, or anything else, on this blog.

Believe it or not this not due my sensitive skin or low self-esteem. It is actually due to surprising sticker shock I got when looking into actual cost of continuing with my formerly-free comment-extension provider. As I mentioned earlier, I am not against paying for things I use; and I can also get over immediate knee-jerk reaction to what may appear as bait-n-switch. So it is not about unwillingness to pay. But I am not willing to pay more for simple comment extension than I pay for hosting the whole thing (including shell access, plenty of storage and transfer space) -- this just does not seem like a reasonable value proposition. As a rough estimate of what I would pay, suitable price should be something similar to lowest fee one has to pay for a custom social network at Ning (20$ / year). That I would be ok with.

So, until I figure out suitable replacement, you will just have to bite your tongue, or send comments directly to me (both of which you are welcome to do! :-) ).

ps. If your comment would be about a free commenting system I could use, please do send it to me -- contact info is available via link next to "About me".

Friday, March 26, 2010

Welcome HP, So Long Dell (and don't let the door hit you in the ass on your way out)

(warning: this is another rant. Sorry!)

Here's another improvement in my daily life: after more than a year of space-shuttle-lift-off noise, short-but-brutish uptimes, and countless curses, family's lean old Dell XP "work"station is out for good. Its only agreeable attribute was its slim neat looks (and sort of neat mechanism used for case, allowing its easy opening -- too bad there's not much to do even if you can open it easily). Good riddance, music to my ears.

The replacement, HP Pavilion Slimline, actually looks every bit as good as its predecessor. But otherwise the two are polar opposites: new box is quiet, as reliable as expected (i.e., "just works"), and its only design flaw is that it comes with an OS written by a company based in Redmond. But that I can live with, since it's not my work machine. :-)
And even Windows seems to have improved a bit between versions (new one has Windows 7, previous one whatever preceded Vista).

Anyway: I just thought I'll share my distaste with Dell products (maybe I should actually include "not-so-good" lists on my semi-professional home page?) now that I am getting rid of them.

It all started couple of years ago, when I decided to stop wasting my time on building my own PCs from components (which made sense after college, could save some money). I figured that with time I spent building PCs, and then troubleshooting problems with components, it just didn't make a whole lot of sense. And so I thought I'd go with something that other customers in general had found usable: back then Dell had highest customer ratings of all PC companies; and save for one friend of mine (who had already fought with Dell's phone "support" people, due to problems with memory chips that were failing; and that no amount of rebooting would ever fix), I wasn't aware of huge problems with the company or its products.

What I found out by experience makes me suspect that the company that had gotten good reviews had been abducted by aliens, and replaced by an ersatz replica or something. Correlation between happy customers and company that produced crap I bought just is not there. I am not talking about customer support (no point calling them wrt. badly designed piece of hardware, IMO, it is not not something a script-reading underpaid remote helper can help a lot with), but rather about quality of hardware. My experience beyond PC fiasco was that their products are competitively priced, but have low quality. For example, laser printer that I bought to replace trusty old Apple writer (which, after having bought second hand, served us for 8 years; for total lifetime of probably 15 years; and would have worked well but I couldn't find new toner cartridges for reasonable prices any more!) was inexpensive, and worked fine for a while. Like, maybe a year. And then broke down. The only thing left are LCD monitors, which I have to admit were reasonably priced, and still work. In fact both are still in active use. So I guess they do produce something other than lemons.

Thinking about that last sentence: I guess I could put my feelings into fitting slogan: Dell -- General Motors of Computers.
Feel free to quote.

ps. I am happy to admit that after kicking that incompetent CEO of theirs out, HP seems to have done nice comeback. Good for them, and us.

Monday, February 22, 2010

Fool's Gold, Standard(s)

Here's something new: some good reading ("Ron Paul's money plan is far from golden") at CNN (sic!): this time about nostalgic folly of returning to the "gold standard". It is surprising that someone whose intellectual aspirations are bit above those of his supporters (ok, granted, that's a low bar), one would be so mistaken about realities of tying national currency into amount of precious metal(s) central bank physically has. Maybe this is why central banks are generally lead by people with economic education and experience, and not physicians.

I mean, yes, from laymanperspective, it would seem nice if that green paper that gets printed on would actually have collateral. But impracticality of full collateralization should be obvious: you don't need much of a thought-exercise to see how and why it would fail; and from that point on, to backtrack and see why this realization (when shared by people who control flow of money) means that attempt would be a self-fulfilling failure. And if we were unlucky, slowly cooking but colossal-cluster-magnitude failure.

In addition to the great depression that is obviusly mentioned in the articles, proponents of "strong currency" managed to starve millions of people to death during late 1800s. I am most familiar with a somewhat starvations in Finland (there were 2 instances): globally speaking these were just blimps on radar (sice the whole country population was barely in millions), but death rate from starvation actually exceeded that of world wars... and all that so that central bank could protect value of currency, by not loaning money (or subsidize seeds), managing to keep central bank in black, and peasants hungry or dead. Famine was orginally triggered by weather, of course, but the catastrophe could have been averted by government action. And in similar vain, in more recent memory, depression of early 90s (in Finland) was also deepened by later crop of strong currency proponents, who tried (ultimately in vain) to keep the currency strong by trying to avoid devaluation. In the end they had to let it float anyway (causing run-off devaluation by something like 30% in a week), but so late that much of damage was already done. Fortunately no one starved to death on account of this failure, although unemployment rate tripled closer to 20%.

I am sure there are many more examples; and some EU countries are currently experiencing related challenges (now that they are forced to exercise certain discipline after screwing up their finances before realizing it must be done).

These examples are closely related to "gold standard" part, in that there is simplistic view of nations having to balance their check books on very short term. This is neither practical nor beneficial. And trying to force it to be done does not make it any more practical, beneficial or wise.

And yet -- it seems that principled fools never let facts get in a way of intuitive theories. So I am just waiting for a grand unified theory that binds together ideas of tax-cut for riches, return to the gold standard, and the idea that poor people caused depression (due to welfare costs allegedly being a major contributor to this whole meltdown -- don't ask me how the mechanism is supposed to play; apparently this claim is getting some consideration in tea bagger circles).

Saturday, December 19, 2009

Could you please tell me some more about athletes' marital problems, CNN?

It is an unfortunate fact of life that "news" services in US are in sorry, tepid state; and to get decent news coverage one has to use better international sources (BBC, or any european agency), or turn to non-daily/non-TV alternatives (magazines, which still offer reasonable in-depth coverage). But this on-going idiotic episode with a celebrity golf player's domestic issues takes the cake as the low point for this decade (maybe competing with media's criticless bashing of UN Iraq nuclear inspectors back in 2002 -- but I digress).

1. What could POSSIBLY be more important issue?

But hey, there have been recent orgies of lesser relevant news (did Michael Jackson's or Ann-Nicole Smith's deaths really warrant being top news entries). Why is this any different? Aside from being even less relevant -- honestly, gossip pages, or perhaps sports section (... which is ridiculously inflated part of local newspapers and TV programmes, anyway...) would have been better placements; and for respectable publications, possibly not even those -- than anything comparable in recent history, there is the thing that there has actually been lots of newsworthy things to write about.

Like, say, that gathering of world leaders in Copenhagen; discussing urgent (and eventually life-and-death) matters of saving the world. And in domestic section, well, there's plenty of economic stuff to write about, or the thing about medical industry and insurance. Oh, and hey, wasn't there a war of sizable portions also going on (actually, two, but who's counting).

In fact, I can't think of a reason for this even ranking on page 7 of thursday edition of the local newspaper. There are tabloids, after all, that could cover this stuff. Well, except that in US, it's not "newspapers vs tabloids"; it's mainstream (tabloid level) and fringes ("news of the world"). Even mainstream sells manufactured controversies (trademark of tabloid in other countries) and social porn.

And yet, somehow what irritates me most is that I noticed that CNN followed up on this stupid episode like a hawk; as if it really was a major story.

2. What did that "N" originally mean?

So why pick on CNN? After all, CNN is to News what MTV is to Music -- sad, irrelevant misnomer. Ted Turner would be rolling in his grave was he not alive. I guess it has more to do with the fact that CNN is ostensibly in the news business. Newspapers and most other networks are in general "media" business; they are also News dilettantes, spewing some amateur-level newsy stuff. But clearly TV networks are more into general entertainment; and newspapers into advertising with some commentary columns (well, actually, they also do do some local news stuff -- useful and sometimes noteworthy -- maybe I am being too harsh -- but only local, seldom even reaching to regional level).

So it's that when even entities that claim to do News fail to do that, well, that's pathetic.

3. Message to mr. Woods

Ok; enough ranting about sad state of US media. But here's a personal message for the nominal cause of this red herring of a news: Tiger, go stuff that golf club up your ass. Sideways. I don't care about your business (personal or otherwise) -- but it appears that your messy business has suddenly become my business. Stop it. Go, disappear. And for crying out loud, don't cry out loud in public. It is so pathetically unmanly that I feel nauseous. So, grow a spine (a pair you apparently already have). Whatever else you do, do NOT cause more media events. You are rich enough to afford to do whatever that other stupid athlete did after murdering his wife (of hey, yeah, come to think of that, do not do what that guy did in the end -- just the initial part of trying to keep low profile).

Thursday, December 03, 2009

Milk of Human Madness, Jule-tide edition

Ok, in between technical time, it's time to review some goofy stuff while we wait for Santa. Here goes...

1. Can't manage to find time to do something useful...

yet have plenty of time for "time management"?

Sound silly? Have a look at Pomodoro Technique. Great for giggles, as a case study for human insanity.
But if it starts to make some sense at any point, do not hesitate to get some professional help. Immediately.

But then again, there are always some co-workers who might benefit others by such techniques: by not having time to do anything, they could not make mistakes. And that's worth something too (brakes for loose cannons).

update: above comments are just related for application of said technique(s) to software development -- maybe other domains could benefit from intrusive regularly-scheduled interruptions (perhaps augmented by electrical shocks)

2. IRC? Yes, that thing hackers use when they don't want to be overheard!

Oh yes, you can always trust Numb3rs to get technical things FUBAR. Funny stuff.

Now, if you will excuse me, I will have to disconnect from my blog server before connection can be traced by FBI (it's that 30 second rule you may know from movies -- must triangulate fast -- gotta go!)

Thursday, July 09, 2009

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

  1. Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
  2. If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.