Wednesday, June 30, 2010

No Comments

Ok, another minor but important change that just occured is that at present there is unfortunately no way to comment on my writing style, content, or anything else, on this blog.

Believe it or not this not due my sensitive skin or low self-esteem. It is actually due to surprising sticker shock I got when looking into actual cost of continuing with my formerly-free comment-extension provider. As I mentioned earlier, I am not against paying for things I use; and I can also get over immediate knee-jerk reaction to what may appear as bait-n-switch. So it is not about unwillingness to pay. But I am not willing to pay more for simple comment extension than I pay for hosting the whole thing (including shell access, plenty of storage and transfer space) -- this just does not seem like a reasonable value proposition. As a rough estimate of what I would pay, suitable price should be something similar to lowest fee one has to pay for a custom social network at Ning (20$ / year). That I would be ok with.

So, until I figure out suitable replacement, you will just have to bite your tongue, or send comments directly to me (both of which you are welcome to do! :-) ).

ps. If your comment would be about a free commenting system I could use, please do send it to me -- contact info is available via link next to "About me".

Tuesday, June 29, 2010

Experiments in advertising, here goes nothing (aka Welcome, AdBrite!)

Ok let's talk about something that is quite visible to you dear readers, but something that you have probably managed to ignore automatically. Yes, I am taking about those commercial decorations on margin of these pages. But please, don't change the channel quite yet. :-)

1. Advertising Changes... yay!

So what's up there? After being a very small AdSense publisher for few years, I figured that I might well retire before ever seeing another check for ads displayed on this blog; so it might be time to explore options: if not to get higher yields then at least maybe get more interesting ads. I also generally root for underdogs, and at this point Google is the ultimate uber-dog if there ever was one. So why not partner up with some other advertising puppies.

Given these loose goals about the only criteria for finding a replacement would be that it is not Google. And, well, ideally it should not be Apple, and preferably not Microsoft. But latter two are negotiable constraints (in fact, I am tempted to check out M$'s PubCenter; if for nothing else due to its catchy name!).

2. So... ?

But enough background discussion: in the end, I decided to change my ad provider from the big G to an unknown-before-about-a-week-ago company called AdBrite. Mostly because they topped this Handy List of Google Adsense Alternatives. And finally, as of today, I bothered to change blog templates for the change to take effect.

3. Can hardly contain my excitement <yawn>

At this point I am curious to see to what kind of ads they might be pushing to my blog. I sort of wish it was something that lots of people found totally repugnant yet completely fascinating... but chances for that are probably low. We'll see -- maybe I need to cycle through variety of ad sales networks before choosing my poison.

4. Commercial Proposal by Author

By the way, if anyone actually wants to actually advertise here -- buy a section for month-by-month advertising, selling something that actually relates to something I have written about -- let me know. I am open to bids and can show google analytics statistics for pricing, so you have a fair idea of what you'd get.
The only limit I will put is that monthly ad space rental fee has to be non-zero positive number in full US dollars. :-)
(you can consider it as the auction starting price)

Monday, June 28, 2010

Async HTTP Client 1.0 released

Ok, this announcement went out last week, but it is important to re-iterate: version 1.0 of Ning async HTTP Client is now out!

This is good news for multiple reasons: obviously ability to do asynchronous (read: non-blocking, i.e. no need to use one thread for each and every open connection) HTTP communication is valuable in itself. But I also hope that this spurs some friendly competition in improving all HTTP-based communication on Java platform -- there are still features that are not implemented, and due to complexity of efficient connection handling, there should be still room for improvement with performance as well. And finally this should lead to wider adoption of this relatively new library, so that it gets properly battle-tested and proven.

I am actually planning on using this client for cases where regular blocking client could work, to see how well it performs and exactly how easy it is to use non-blocking API, compared to existing blocking alternatives.

Tuesday, June 08, 2010

Introducing My Newest Open Source project: Tr13 (that is so 1337!)

I have briefly mentioned that I have been working on a new little low-level data structure package, called (Ning) Tr13, started at GitHub. Now that code is getting ready and tested, it is time to "formally" introduce it.

Tr13 is an interesting project form multiple angles: for one, it is my first "corporate-sponsored" Open Source project (i.e. related to may day-to-day work -- although not focus of it, just a useful thing for me and hopefully my colleagues at Ning). This alone is worth celebrating; the whole process has been remarkably smooth from 15-minute end-to-end approval process (take note Amazon developer community...), tosteady if somewhat slow development itself (there is nothing groundbreaking from CS perspective -- more on this below).
This is also a rare case of my already having reasonably clear idea as to where I would be using this package. It may sound surprising, but truthfully most open source libraries I have written, I have not actually absolutely needed them before writing. :-)
(there are exceptions: Java UUID Generator is something I wrote since I needed it -- but not so for Woodstox, or even Jackson, to some degree, both of which I have extensively used after I wrote them).
And finally, well, it is a change from my more common data format parsing / data binding work. Although I was thinking of "climbing up the stack" to write something higher level, I ended up going the other way, a slight detour.

Anyway, I digress. So...

What exactly is tr13?

It is a space-efficient ("compact") read-only implementation of Trie data structure, with slight modification to use Patricia/Radix structure for leaves (but not for branches, since that did not seem useful for my use cases).

Current implementation supports two kinds of Tries:

  1. Byte[] key, variable integer (~= long) value
  2. Byte[] key, byte[] value

In both cases, byte[] is typically a String, but can be any other serializable thing; for example, ints and longs could be trivially converted to 4/8 byte keys. Separate variant for variable-length integers is further optimization for my initial use case, where values are generally small ints, which can fit in a single byte, but where it is convenient not to be bound by arbitrary limits (like 256).

Internally structure is actually just a regular byte array (or, alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format needs to be documented, but is a rather simple implementation of basic trie ideas, nothing to write a dissertation about.

Why does it Rock?!

Given how good plain old JDK HashMap can be -- you could just use HashMap<String,Integer> for the first case -- what is so special about tr13?

Simply put: memory usage, or lack thereof (ditto for GC activity). This because:

  1. Overall memory usage is pretty tight -- for test data I have (30 million entries of typically 13-byte key, small integer number == 600 meg input file), trie structure actually uses less memory than its file representation (by about 35% less). And,
  2. Since it is all in just TWO objects (raw byte array, wrapper Trie object), it should keep GC rather happy.

This is not coincidental, since the use case we have is that of sizable routing tabls that should ideally be kept fully in-memory. Hence compactness of data structure, with reasonable lookup times (i.e. no linear search), was design criteria. Even other somewhat frugal data structures (like b-trees) are not anywhere as space efficient.


First obvious limitation is that it is a read-only data structure: you must have all the data at hand when building it. Data also has to be ordered, not necessarily alphabetically (although that is the usual way), but in a stable lexicographic ordering, such that build process has invariants it needs. This is not as big a limitation as it might seem, at least when merging functionality (more on this later on) is added. For now it may be necessary to re-build data structure regularly, and keep an additional "short-term" lookup data structure at hand for additions that are needed between re-builds.

As to performancem resulting data structure is not the fastest way to find things: from complexity standpoint it is a linear algorithm, compared to theoretically constant time for hash-based alternatives. This may or may not matter; it should not matter a lot with respect to key length (HashMap has to calculate hash for key and compare key bytes, unless keys can be canonicalized). Another potential performance concern is that trie data structure does not have much locality; but this too is unlikely to be much worse than for typical alternatives. In fact, due to high memory efficiency, and use of plain byte array, it is not guaranteed that HashMap has any better locality, even if a single lookup results in a match (this due Java GC's habit of "shuffling" object data around).

All in all, Tr13 does much more work when finding values, since it traverses data structure once per each key byte (except for Patricia-optimized tail, which may be multi-byte linear match). So I would not expect it to be as fast as direct hash lookups.

But in the end, these are speculations, and the real test would be use.

... or, how about a performance test?

Performance Testing

Ok: the question of efficiency is something worth looking deeper into. And so I wrote a benchmark that compares VIntTrieLookup (byte[]-to-vint trie) against HashMap<String,Integer> (for curious, it's under 'src/java/test/manual/' in source tree). Test loads all the entries in memory, and then performs randomized lookups (one lookup for every key, but order shuffled so as to be different from 'natural' ordering of trie).

When given a subset of the whole data file (turns out HashMap hogs so much memory that I just can't test the whole data set!) -- a 32 meg, 2 million entry chunk -- here are the results:

Memory usage:

  1. Tr13: 20.6 megs (reduction of about one third)
  2. HashMap<String,Integer>: 314 megs

So in this case, HashMap<String,Integer> uses about 15 times more memory than Tr13 (note: Integer values are canonicalized using Integer.valueOf -- without this, usage would be 360 megs).

And what about performance? What is the cost of compactness?

For 3 million lookups, time taken was:

  1. Tr13: about 5000 milliseconds (about 400,000 lookups per second)
  2. HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000 lookups per second)

so HashMap is about 4 times faster than Tr13 for lookups, on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.

Since my first need is to be able to do lookups in memory, performance difference is not of particular importance (there might be one or two lookups per request, so anything in tens of thousands per second would work), but it is good to know rough order of magnitude difference.

What next?

At this point Tr13 should be ready for use, and I plan to release version 0.5 soon. The only thing I am waiting for is to get a Maven repository set up at Sonatype OSS repository (kudos to Sonatype for this free hosting!), so that Maven-based (and I assume, Ivy-based) projects can use it. This is actually what I need for my daytime use.

But beyond Maven deployment, there are some additional features I want to implement in near future, such as:

  • Ability to traverse through entries (Map.Entry<> style)
  • Ability to merge two tries efficiently -- could be built quite easily given iterator functionality
  • ... document the data structure (ok ok, not a development task, but "hard" (read: tedious) yet necessary work) -- it is very simple, but not so simple that reading code would make it obvious.
  • Current implementations are limited to 2 gigs, due to Java byte[] and ByteBuffer limitations. But fundamentally there is nothing particularly hard about writing segmented versions where only segments would have this limitation (well, probably values too, but I don't see any need for 2gb+ values in my tries...)

Merging functionality specifically would be useful for "semi-mutable" lookup data structures. For data that evolves slowly (which is the case for routing table use case I have), it is possible to keep track of needed modifications (in a, say, HashMap), and do primary lookup against that data structure, and only if no match is found, against immutable trie. And when this short-term data structure goes enough, or enough time passes, trie could be re-built with existing and new data. Repeat and rinse.

And perhaps it might even be possible to work some more on fundamental data structure itself, and find ways to make it mutable. Insertions (and deletion) operations would be rather costly, due to global changes needed (another downside of compactness is that changes tend to spill far). But even slow mutability would be better than no mutability; and perhaps there could be ways to trade some space (intentional unused regions within trie, to be used for insertions?) for lesser likelihood of having move all the data around.

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.