I have briefly mentioned that I have been working on a new little
low-level data structure package, called (Ning)
Tr13, started at GitHub. Now that code is getting ready and tested,
it is time to "formally" introduce it.
Tr13 is an interesting project form multiple angles: for one, it is my
first "corporate-sponsored" Open Source project (i.e. related to may
day-to-day work -- although not focus of it, just a useful thing for me
and hopefully my colleagues at Ning). This alone is worth celebrating;
the whole process has been remarkably smooth from 15-minute end-to-end
approval process (take note Amazon developer community...), tosteady if
somewhat slow development itself (there is nothing groundbreaking from
CS perspective -- more on this below).
This is also a rare case of my
already having reasonably clear idea as to where I would be using this
package. It may sound surprising, but truthfully most open source
libraries I have written, I have not actually absolutely needed them
before writing. :-)
(there are exceptions: Java UUID Generator is
something I wrote since I needed it -- but not so for Woodstox, or even
Jackson, to some degree, both of which I have extensively used after I
And finally, well, it is a change from my more common
data format parsing / data binding work. Although I was thinking of
"climbing up the stack" to write something higher level, I ended up
going the other way, a slight detour.
Anyway, I digress. So...
What exactly is tr13?
It is a space-efficient ("compact") read-only implementation of Trie
data structure, with slight modification to use Patricia/Radix structure
for leaves (but not for branches, since that did not seem useful for my
Current implementation supports two kinds of Tries:
Byte key, variable integer (~= long) value
Byte key, byte value
In both cases, byte is typically a String, but can be any other
serializable thing; for example, ints and longs could be trivially
converted to 4/8 byte keys. Separate variant for variable-length
integers is further optimization for my initial use case, where values
are generally small ints, which can fit in a single byte, but where it
is convenient not to be bound by arbitrary limits (like 256).
Internally structure is actually just a regular byte array (or,
alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format
needs to be documented, but is a rather simple implementation of basic
trie ideas, nothing to write a dissertation about.
Why does it Rock?!
Given how good plain old JDK HashMap can be -- you could just use
HashMap<String,Integer> for the first case -- what is so special about
Simply put: memory usage, or lack thereof (ditto for GC activity). This
Overall memory usage is pretty tight -- for test data I have (30
million entries of typically 13-byte key, small integer number == 600
meg input file), trie structure actually uses less memory than its
file representation (by about 35% less). And,
Since it is all in just TWO objects (raw byte array, wrapper Trie
object), it should keep GC rather happy.
This is not coincidental, since the use case we have is that of sizable
routing tabls that should ideally be kept fully in-memory. Hence
compactness of data structure, with reasonable lookup times (i.e. no
linear search), was design criteria. Even other somewhat frugal data
structures (like b-trees) are not anywhere as space efficient.
First obvious limitation is that it is a read-only data structure: you
must have all the data at hand when building it. Data also has to be
ordered, not necessarily alphabetically (although that is the usual
way), but in a stable lexicographic ordering, such that build process
has invariants it needs. This is not as big a limitation as it might
seem, at least when merging functionality (more on this later on) is
added. For now it may be necessary to re-build data structure regularly,
and keep an additional "short-term" lookup data structure at hand for
additions that are needed between re-builds.
As to performancem resulting data structure is not the fastest way to
find things: from complexity standpoint it is a linear algorithm,
compared to theoretically constant time for hash-based alternatives.
This may or may not matter; it should not matter a lot with respect to
key length (HashMap has to calculate hash for key and compare key bytes,
unless keys can be canonicalized). Another potential performance concern
is that trie data structure does not have much locality; but this too is
unlikely to be much worse than for typical alternatives. In fact, due to
high memory efficiency, and use of plain byte array, it is not
guaranteed that HashMap has any better locality, even if a single lookup
results in a match (this due Java GC's habit of "shuffling" object data
All in all, Tr13 does much more work when finding values, since it
traverses data structure once per each key byte (except for
Patricia-optimized tail, which may be multi-byte linear match). So I
would not expect it to be as fast as direct hash lookups.
But in the end, these are speculations, and the real test would be use.
... or, how about a performance test?
Ok: the question of efficiency is something worth looking deeper into.
And so I wrote a benchmark that compares VIntTrieLookup (byte-to-vint
trie) against HashMap<String,Integer> (for curious, it's under
'src/java/test/manual/' in source tree). Test loads all the entries in
memory, and then performs randomized lookups (one lookup for every key,
but order shuffled so as to be different from 'natural' ordering of
When given a subset of the whole data file (turns out HashMap hogs so
much memory that I just can't test the whole data set!) -- a 32 meg, 2
million entry chunk -- here are the results:
Tr13: 20.6 megs (reduction of about one third)
HashMap<String,Integer>: 314 megs
So in this case, HashMap<String,Integer> uses about 15 times more
memory than Tr13 (note: Integer values are canonicalized using
Integer.valueOf -- without this, usage would be 360 megs).
And what about performance? What is the cost of compactness?
For 3 million lookups, time taken was:
Tr13: about 5000 milliseconds (about 400,000 lookups per second)
HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000
lookups per second)
so HashMap is about 4 times faster than Tr13 for lookups,
on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.
Since my first need is to be able to do lookups in memory, performance
difference is not of particular importance (there might be one or two
lookups per request, so anything in tens of thousands per second would
work), but it is good to know rough order of magnitude difference.
At this point Tr13 should be ready for use, and I plan to release
version 0.5 soon. The only thing I am waiting for is to get a Maven
repository set up at Sonatype OSS repository (kudos to Sonatype for this
free hosting!), so that Maven-based (and I assume, Ivy-based) projects
can use it. This is actually what I need for my daytime use.
But beyond Maven deployment, there are some additional features I want
to implement in near future, such as:
Ability to traverse through entries (Map.Entry<> style)
Ability to merge two tries efficiently -- could be built quite easily
given iterator functionality
... document the data structure (ok ok, not a development task, but
"hard" (read: tedious) yet necessary work) -- it is very simple, but
not so simple that reading code would make it obvious.
Current implementations are limited to 2 gigs, due to Java byte and
ByteBuffer limitations. But fundamentally there is nothing
particularly hard about writing segmented versions where only segments
would have this limitation (well, probably values too, but I don't see
any need for 2gb+ values in my tries...)
Merging functionality specifically would be useful for "semi-mutable"
lookup data structures. For data that evolves slowly (which is the case
for routing table use case I have), it is possible to keep track of
needed modifications (in a, say, HashMap), and do primary lookup against
that data structure, and only if no match is found, against immutable
trie. And when this short-term data structure goes enough, or enough
time passes, trie could be re-built with existing and new data. Repeat
And perhaps it might even be possible to work some more on fundamental
data structure itself, and find ways to make it mutable. Insertions (and
deletion) operations would be rather costly, due to global changes
needed (another downside of compactness is that changes tend to spill
far). But even slow mutability would be better than no mutability; and
perhaps there could be ways to trade some space (intentional unused
regions within trie, to be used for insertions?) for lesser likelihood
of having move all the data around.