Introducing My Newest Open Source project: Tr13 (that is so 1337!)
I have briefly mentioned that I have been working on a new little low-level data structure package, called (Ning) Tr13, started at GitHub. Now that code is getting ready and tested, it is time to "formally" introduce it.
Tr13 is an interesting project form multiple angles: for one, it is my
first "corporate-sponsored" Open Source project (i.e. related to may
day-to-day work -- although not focus of it, just a useful thing for me
and hopefully my colleagues at Ning). This alone is worth celebrating;
the whole process has been remarkably smooth from 15-minute end-to-end
approval process (take note Amazon developer community...), tosteady if
somewhat slow development itself (there is nothing groundbreaking from
CS perspective -- more on this below).
This is also a rare case of my already having reasonably clear idea as to where I would be using this package. It may sound surprising, but truthfully most open source libraries I have written, I have not actually absolutely needed them before writing. :-)
(there are exceptions: Java UUID Generator is something I wrote since I needed it -- but not so for Woodstox, or even Jackson, to some degree, both of which I have extensively used after I wrote them).
And finally, well, it is a change from my more common data format parsing / data binding work. Although I was thinking of "climbing up the stack" to write something higher level, I ended up going the other way, a slight detour.
Anyway, I digress. So...
What exactly is tr13?
It is a space-efficient ("compact") read-only implementation of Trie data structure, with slight modification to use Patricia/Radix structure for leaves (but not for branches, since that did not seem useful for my use cases).
Current implementation supports two kinds of Tries:
- Byte key, variable integer (~= long) value
- Byte key, byte value
In both cases, byte is typically a String, but can be any other serializable thing; for example, ints and longs could be trivially converted to 4/8 byte keys. Separate variant for variable-length integers is further optimization for my initial use case, where values are generally small ints, which can fit in a single byte, but where it is convenient not to be bound by arbitrary limits (like 256).
Internally structure is actually just a regular byte array (or, alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format needs to be documented, but is a rather simple implementation of basic trie ideas, nothing to write a dissertation about.
Why does it Rock?!
Given how good plain old JDK HashMap can be -- you could just use HashMap<String,Integer> for the first case -- what is so special about tr13?
Simply put: memory usage, or lack thereof (ditto for GC activity). This because:
- Overall memory usage is pretty tight -- for test data I have (30 million entries of typically 13-byte key, small integer number == 600 meg input file), trie structure actually uses less memory than its file representation (by about 35% less). And,
- Since it is all in just TWO objects (raw byte array, wrapper Trie object), it should keep GC rather happy.
This is not coincidental, since the use case we have is that of sizable routing tabls that should ideally be kept fully in-memory. Hence compactness of data structure, with reasonable lookup times (i.e. no linear search), was design criteria. Even other somewhat frugal data structures (like b-trees) are not anywhere as space efficient.
First obvious limitation is that it is a read-only data structure: you must have all the data at hand when building it. Data also has to be ordered, not necessarily alphabetically (although that is the usual way), but in a stable lexicographic ordering, such that build process has invariants it needs. This is not as big a limitation as it might seem, at least when merging functionality (more on this later on) is added. For now it may be necessary to re-build data structure regularly, and keep an additional "short-term" lookup data structure at hand for additions that are needed between re-builds.
As to performancem resulting data structure is not the fastest way to find things: from complexity standpoint it is a linear algorithm, compared to theoretically constant time for hash-based alternatives. This may or may not matter; it should not matter a lot with respect to key length (HashMap has to calculate hash for key and compare key bytes, unless keys can be canonicalized). Another potential performance concern is that trie data structure does not have much locality; but this too is unlikely to be much worse than for typical alternatives. In fact, due to high memory efficiency, and use of plain byte array, it is not guaranteed that HashMap has any better locality, even if a single lookup results in a match (this due Java GC's habit of "shuffling" object data around).
All in all, Tr13 does much more work when finding values, since it traverses data structure once per each key byte (except for Patricia-optimized tail, which may be multi-byte linear match). So I would not expect it to be as fast as direct hash lookups.
But in the end, these are speculations, and the real test would be use.
... or, how about a performance test?
Ok: the question of efficiency is something worth looking deeper into. And so I wrote a benchmark that compares VIntTrieLookup (byte-to-vint trie) against HashMap<String,Integer> (for curious, it's under 'src/java/test/manual/' in source tree). Test loads all the entries in memory, and then performs randomized lookups (one lookup for every key, but order shuffled so as to be different from 'natural' ordering of trie).
When given a subset of the whole data file (turns out HashMap hogs so much memory that I just can't test the whole data set!) -- a 32 meg, 2 million entry chunk -- here are the results:
- Tr13: 20.6 megs (reduction of about one third)
- HashMap<String,Integer>: 314 megs
So in this case, HashMap<String,Integer> uses about 15 times more memory than Tr13 (note: Integer values are canonicalized using Integer.valueOf -- without this, usage would be 360 megs).
And what about performance? What is the cost of compactness?
For 3 million lookups, time taken was:
- Tr13: about 5000 milliseconds (about 400,000 lookups per second)
- HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000 lookups per second)
so HashMap is about 4 times faster than Tr13 for lookups, on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.
Since my first need is to be able to do lookups in memory, performance difference is not of particular importance (there might be one or two lookups per request, so anything in tens of thousands per second would work), but it is good to know rough order of magnitude difference.
At this point Tr13 should be ready for use, and I plan to release version 0.5 soon. The only thing I am waiting for is to get a Maven repository set up at Sonatype OSS repository (kudos to Sonatype for this free hosting!), so that Maven-based (and I assume, Ivy-based) projects can use it. This is actually what I need for my daytime use.
But beyond Maven deployment, there are some additional features I want to implement in near future, such as:
- Ability to traverse through entries (Map.Entry<> style)
- Ability to merge two tries efficiently -- could be built quite easily given iterator functionality
- ... document the data structure (ok ok, not a development task, but "hard" (read: tedious) yet necessary work) -- it is very simple, but not so simple that reading code would make it obvious.
- Current implementations are limited to 2 gigs, due to Java byte and ByteBuffer limitations. But fundamentally there is nothing particularly hard about writing segmented versions where only segments would have this limitation (well, probably values too, but I don't see any need for 2gb+ values in my tries...)
Merging functionality specifically would be useful for "semi-mutable" lookup data structures. For data that evolves slowly (which is the case for routing table use case I have), it is possible to keep track of needed modifications (in a, say, HashMap), and do primary lookup against that data structure, and only if no match is found, against immutable trie. And when this short-term data structure goes enough, or enough time passes, trie could be re-built with existing and new data. Repeat and rinse.
And perhaps it might even be possible to work some more on fundamental data structure itself, and find ways to make it mutable. Insertions (and deletion) operations would be rather costly, due to global changes needed (another downside of compactness is that changes tend to spill far). But even slow mutability would be better than no mutability; and perhaps there could be ways to trade some space (intentional unused regions within trie, to be used for insertions?) for lesser likelihood of having move all the data around.