Friday, October 29, 2010

Faster compression on Java platform with LZF

One open source project that I have worked on a bit lately is compress-lzf, a simple Java library that implements block- and stream-accessible version of LZF compression algorithm. LZF algorithm itself is very simple, and mostly (only?) documented here, in form of working reference implementation that doubles as documentation.

Our implementation is a simple rewrite of C version, and fully compatible with the reference implementation including file header and block signature (this is different from the other existing Java version that projects like H2 include). While simple implementation, it is not naive translation; and some effort has been spent on optimizing its performance.

And as should be the case for such low-level libraries, there are no external dependencies beyond JDK 1.5; meaning it should work on variety of platforms, including Android (in fact it could probably be compiled on any 1.x JDK)

1. What is LZF?

LZF compression algorithm is basic Lempel-Ziv, with little additional processing; so one way to think of it is that it is "gzip without Huffman encoding" (gzip uses Deflate, which is Lempel-Ziv followed by Huffman encoding, with variable bit-length).
And Lempel-Ziv is just method whereby one finds repeating sub-sequences in output, replacing sequences of three or more "already seen bytes" with a back-reference; matches are typically stored in a hash table indexed by hash of first 3 bytes, value being offset of last instance of potentially matching sequence.

It is worth noting that many other similar (and sometimes similarly named) algoritms likewise use basic LZ; "Fastlz" for example is very similar.

2. Gzip exists, why bother with LZF?

But why does this matter? If gzip is this and more, why would you want to use a "lesser" algorithm?

Key benefit is that of speed: LZF can be up to twice as fast to uncompress; but where it REALLY shines is compression speed -- here it can be 2x or 3x as fast (writing a stand-alone benchmark is on my todo list, so hopefully I will get to talk a bit more about this in future). Performance difference is mostly due to relatively high overhead that deflate uses for non-byte-aligned output (which is needed for optimal Huffman-encoding of codewords) -- while it definitely helps compression, it is expensive in terms of CPU usage.

LZF compression rate is lower than with gzip, but compression profile is similar: things that compress well with gzip also compress well with LZF; and those that compress less do so with LZF as well. This makes sense given that most compression often comes from LZ algorithm, which is implemented similarly.

So it all comes down to this: gzip compression is very CPU-intensive (read: slow); so much so that you typically use 5x as long to compress things like JSON data than writing content itself. With fast networks and I/O this often becomes bottleneck.
LZF can be a good alternative since its CPU overhead is much more modest; and it can still compress "fluffy" content pretty well -- where gzip can compress something by, say, 90%, LZF usually gets to 75-80% ratio (this obviously varies by types of content; basically depends on which part of deflate algorithm is more effective)

3. Usage

Assuming that above sounds interesting, you can get compress-lzf library either from GitHub; or via Maven. In latter case, information is:

  • <groupId>com.ning</groupId>
  • <artifactId>compress-lzf</artifactId>
  • current version at time of this writing is "0.6.0"

And from there you can use class from com.ning.compress.lzf package:

  • LZF is a simple command-line utility you can use for manual compression/uncompression (jar defines this class as the Main-Class so you can just invoke it with "java -jar compress-lzf.jar" to get usage information)
  • LZFInputStream and LZFOutputStream are compressing input (read LZF compressed data) and output (write LZF compressed data) streams, similar to JDK's GZIPInputStream and GZIPOutputStream
  • LZFEncoder and LZFDecoder are block-based codecs that can be used to encode (compress) and decode (uncompress) chunks (byte[]) of data

Both streaming and block-based interfaces have their uses; for example, when storing BLOBs in database, block API works better; and when reading data over network or from filesystem, streams make more sense.

3.1 Note: avoid LZFOutputStream.flush() ("just let it float...")

One note on LZFOutputStream: be careful with flush()ing -- just like other block compressors (including gzip), LZF can only do back-references within single block; and since LZFOutputStream.flush() forces closing of an open block, it can dramatically reduce compression rate. As such it is best to avoid flush()ing if at all possible, unless you actually want to create a logical boundary at some point.

4. Future plans for the library

While there are always improvements for any Open Source projects, I think the natural scope for LZF library is quite small. Similar to Java Uuid Generator, it is quite possible that there won't be need for major changes to implementation or API; which means that future improvements are most likely in areas of integration; helping other projects use LZF.

For example, given unique CPU/io/bandwidth trade-off LZF can offer, it could make a really nice component for large-scale data transfer and processing, for example:

  • Data processing with map/reduce (Hadoop)
  • Data storage (with frameworks like jDBI, Hibernate; or with NoSQL projects like Voldemort, Cassandra)
  • Data aggregation (logging, log processing, Scribe)
  • Caching: use LFZ with memcached?
  • Combination of high-performance data formats (Jackson Smile, binary JSON serialization) and compression -- can be combined with other use cases too.

So it will be interesting to see where things go in near future.

5. On benefits of working for a company that is comfortable with Open Source

Unlike with many earlier Open Source projects that I have worked on, much of work that went into this open source library was done on company time. It also was a team effort (Jon wrote the streaming handling, I wrote block codec); compared to most other open source projects where I have been the main author. So I have come a long way, having worked for a semi-OS-hostile company, then a bit for a neutral (agnostic?) company; and now working for one hat is actually engaged with community (Ning). And I must admit that I really like that Ning is openly committed to working with open source projects, not just using open source artifacts others make. This is not solely meant as an ad for Ning (although we are indeed hiring...), I just think it is important to give credit where it is due.

Beyond this, Ning section at Github already has some public projects, and there are literally a dozen more being cleaned up to be ready for publishing. Within just a few more months I think Ning will be more recognized for its contributions in Open Source space; and eventually become a signifant player. I see this as a major benefit, making it bit less difficult to hire the best developers; and especially so for a company that is still relatively small, and thus having less name recognition than the big boys.

Saturday, October 23, 2010

Java Pearls: case-insensitive Map?

Here's one more thing I did not know until very recently: how to get a Map with String keys that allows case-insensitive access? Or, case-insensitive membership checks. One typical use case is to handle case-insensitive entities such as names of HTML tags.

One obvious way to do this is to canonicalize keys by lower-casing (or upper-casing) keys both when inserted and when doing lookups. But this is actually not the only way to do this: there is a very simple JDK-provided alternative:

  Map<String,Object> tags = new TreeMap<String,Object>

and that's it; you can access values with any combination of casing. Similarly you can also create TreeSets for basic inclusion checks.

I hope this is something that most Java developers know and I had just somehow missed. But if not, here it is in case you happen to need such access.

ps. For what it's worth, the main reason I had missed this being a possibility was that I always thought comparator given to TreeMap and TreeSet would only be used for ordering. But thinking about it a bit more, this makes more sense; since comparisons must be done anyway (to find location of an entry to find or insert), why not use "equals" return value of comparator for something.

Saturday, October 16, 2010

From Canada, with Better Windmills

This article ("Canadian’s spin on windmill design touted as green-energy breakthrough") fascinated me quite a bit. Although I am not expert in the field of energy technology in general (or wind-based energy in particular), this seems credible enough, assuming article itself is accurate in its description of the progress within field. We'll see what comes of it in future.

Thursday, October 14, 2010

More on Java UUID Generator (JUG), a word on performance

Ok, so Java UUID Generator version 3 is out. Beyond question of what is new with this version (which I outlined in the previous entry), the next asked was "how is performance?".
My first thought was that it should be about the same as what version 2.0 has; but after some more thinking I realized that this is not a very useful answer to give.

So I wrote a simple micro-benchmark to find out (it's in 'src/main/java/perf/') more about performance aspects. Test uses three main generation methods (random-, time- and name-based) for JUG, and matching alternatives that java.util.UUID has (random- and limited named-based variant).

1. Performance of UUID generation, general observations

High-level summary would be:

  1. Performance between equivalent methods between JDK and JUG have similar performance (no surprise) -- JDK just does not provide time-based variant (one most commonly used by non-Java systems); nor ability to configure Random-based variant
  2. Performance between three main generation types varies a lot.

2. Actual numbers

So how big are differences? Here is actual snapshot from running the test on my MacBook: each run generates 250,000 UUIDs, and time taken (after test ran for about a minute) is reported as milliseconds:

  Test 'JDK, random' -> 974 msecs
  Test 'JDK, name' -> 230 msecs
  Test 'Jug, SecureRandom' -> 971 msecs
  Test 'Jug, java.util.Random' -> 33 msecs
  Test 'Jug, time-based' -> 39 msecs 
  Test 'Jug, name-based' -> 256 msecs

Since these values fluctuate a bit, exact values are not as important as ratios. Random-based variants are slowest for both implementations (except for case of using 'java.util.Random' -- more on this in a bit); name-hash (MD5 by default; SHA-1 available with Jug too) methods are bit faster; and time/location-based variant is rather fast, close to 50% of maximum theoretical per-node generation rate.

So generation rates for this system would be:

  • Random-based method (using SecureRandom) generates about 250k UUIDs per second
  • Name-based variant (MD5) without namespace is almost 4 times as fast, generating about 1 million UUIDs per second
  • Time-based variant (Ethernet address plus timestamp) is much faster -- almost 20 times as fast as Random-based default variant -- generating about 5 million UUIDs per second. This is also close to theoretical maximum rate for the method, which is 10 million UUIDs per second (sustained), because timestamp granularity (100 nanoseconds) prevents higher throughput (to work around this, one can generate parallel generators using different Ethernet address)

Quite a difference! It is worth noting, too, that speed of name-based method is related to both existence of namespace (this test did not use one; adding one slows things down slightly) and length of name used; latter because hash is calculated on name used. And finally, hash method used also matters (SHA-1 is slower than MD5).

But what is up with the random variant? And didn't I forget one impressive number from my analysis above?

3. Why is Random-based generation so slow?

Constructing "secure" pseudo-random numbers is not easy, and this is the main reason for slowness. Looking at stack traces, it looks like JDK actually uses random device provided by the underlying operating system; and this is part of the slowness (due to use of JNI). But generation itself is fundamentally slow, even if no system calls were needed. It is quite possible that faster machines might not even speed up performance significantly.

4. Couldn't we just use 'java.util.Random' to speed things up?

But if the "secure" (i.e. "good", meaning one providing good level of randomness) variant is slow, couldn't we use a faster one? Like "java.util.Random"? In fact, didn't numbers above show using it resulting in version as fast as using time-based variant?

Yes, this is possible: and JUG allows you to pass any java.util.Random instance you want (it just defaults to SecureRandom).
But speed is not everything: the main point of UUIDs is the uniqueness, and this is why SecureRandom is used as the default by both JUG and JDK. And while the default pseudo-random number generator that java.util.Random uses is fine for many uses, I am not sure I would trust its simple algorithm for generating UUIDs; even if initialized using system clock (System.currentTimeMillis()). The underlying algorithm just does not have any actual source of randomness beyond initial seed value and I do not trust that its random number distribution, then, is good enough to give strong guarantees of uniqueness (I would love pointers to someone discussing actual measurements or research on this!).

So to me downgrading to using simple text-book pseudo-number algorithms does not seem tempting, even if it speeds things up by an order of magnitude.

5. So what would make more sense?

Well, looking at numbers, how about consider time-based variant? Name-based variant is not quite as useful; not so much due to performance aspects, but since it actually requires you to hand source of uniqueness (name to hash).

6. Thoughts on relevance of above

I was bit surprised by how big the difference actually is. But whether it really matters depends on use case -- oftentimes you don't need even thousands of UUIDs per second (most services can not really serve even one thousands requests per second), so the rate of a quarter million per second is just fine. After all, it only takes about four microseconds (on average) to generate one UUID.

But there are use cases where this matters more; especially if UUIDs are used liberally, for example using UUIDs as identifiers for all business objects generated during processing.
So if your use case truly depends on (or at least benefits from) fast generation of UUIDs, you probably want to use time-based generation method from Jug.

7. How did I do that again?

Ok; since I haven't yet gotten Javadocs set up, here is the way to generate time-based UUIDs.

  // need to pass Ethernet address; can either use real one (shown here)
  EthernetAddress nic = EthernetAddress.fromInterface();
  // or bogus which would be gotten with: EthernetAddress.constructMulticastAddress()
  TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(nic);
  // also: we don't specify synchronizer, getting an intra-JVM syncer; there is
  // also external file-locking-based synchronizer if multiple JVMs run JUG
  UUID uuid = uuidGenerator.generate();

And that's it. Other variants are generated similarly, by creating generator instance with Generators factory.

Wednesday, October 13, 2010

Something new, something rather old: Java Uuid Generator, version 3.0

As briefly mention in the previous blog entry, Java Uuid Generator version 3.0 was just released. For those who do not know it, JUG is my first widely used Open Source library, dating to something like 2002; written when JDK 1.1 was still the bee's knees.

Version 3.0 is just a bit of refurbishing (2.0 was released almost 5 years ago), and no new functionality was added.
But it's good cleanup still, achieving following:

  • Move to using standard java.util.UUID (which came with JDK 1.4) for better interoperability
  • Build jar as an OSGi bundle (it has no dependencies, no big deal)
  • Deploy to Maven repos (using Sonatype OSS repository), use Maven for build as well
  • Better API: instead of a global static singleton UUIDGenerator, construct generator instances (for different methods)
  • Simplified synchronization (generators are fully synchronized)
  • Offer access to Ethernet address (MAC address) without native code, using JDK 1.6 method 'NetworkInterface.getHardwareAddress()'

Obvious downside is that API did change, so existing code that uses JUG 2.0 will not work as is. But since classes are now in package "com.fasterxml.uuid"), it is possible for both versions to co-exist. 3.0 will also require JDK 1.4 (2.0 could run on 1.1); or 1.6 if Ethernet address access functionality is needed.

Anyway, if you are using JUG 2.0, update probably makes sense and should be easy to do.

ps. Why use JUG over java.util.UUID? JDK version can only generate random-number based UUIDs and convert from bytes to UUIDs (or vice versa) -- JUG can generate other variants (name-based, time/location based).

Tuesday, October 12, 2010

Look back on "prioritizing Open Source projects" (from May 2010)

It has been more than 4 months since I wrote about my experiences with priorization for Open Source projects, it seems like good time to see how things have been moving.

Looks like there are two ways to look at things -- whether glass is half full or half empty -- as I have pretty much completed 50% of tasks; but not necessarily in order of priority. And this even thought I publicly outlined priorities.

One positive thing is that the top entry (Java UUID Generator 3.) was just completed; and the second entry (Woodstox 4.1) is nicely in progress, to be completed within a month or two. On the other hand, other two completed tasks (both related to Jackson 1.6 that was completed a month ago) were entries listed as having the lowest priority. Some entries not on the list were also completed; specifically work with Async HTTP Client and OAuth signature calculation.

I guess I think this is reasonable outcome, as priority lists for my "hobby" development are there to help and assist, not to drive to specific business goals or to rein in my creativity. So even more important than getting things done in "right order" is that things do get done. So as long as more important or more urgent things are more likely to get worked on than less important or urgent things, overall efficiency remains brutally high, which is the way I like it. Finally, part of the reason for fluctuating order of execution is due to some tasks being more interesting than others; and working on "most interesting" things tends to maximize amount of progress (in contrasts to working on less interesting but more highly prioritized things).

But to get some closure on this entry, let's consider this a completed 4-month Scrum and create an updated priority list. Here's what it might look like:

  1. Complete Woodstox 4.1 (XML Schema, other user requested features) -- carry-over from the original list
  2. Aalto 1.0: finalize async API, implementation
  3. Jackson 1.7: focus on extensibility (module registration, contextual serializers)
  4. ClassMate (1.0?) -- library for fully resolving generic types; based on Jackson code
  5. External version of Mr Bean (from Jackson 1.6)
  6. StaxMate 2.1? (from the original list)
  7. Tr13 1.0? (from the original list)
  8. ... and then re-consider

This is an incomplete list and I expect roughly similar completion rate if I was to look back again in 4 months. Maybe I should start doing quarterly project reviews just for fun. :-)

Monday, October 11, 2010

Jackson tips: overriding serializer or type for Map keys, values, Collection/List values

I have only recently realized that much of Jackson functionality is a unknown to most users. Part of unawareness is due to lack of documentation; but even though documentation has arguably been improving over time (at least Javadocs and rest of FasterXML wiki), there are limits to what can be achieved by feature-listing style of documentation. To add supporting documentation, then, I am trying to write bit more about both specific features that could use more spotlight, as well as more how-to style explanations.

So let's start with something reasonably simple that is not universally known by Jackson users (this is actually based on my answer to one expert user):

Jackson Tip of 21-Oct-2010: annotations for overriding deserialization settings of Collection and Map properties

Let's say that you have bean like this:

class Bean {
  public List<Object> values;

where you want to deserialize List entries as type 'ValueBean'. To specify type of 'values' property itself you could use annotation; but this can not be used to define content type (since annotation values can not be generic types, only classes). So how do you do specify deserialization type of List values (or Map values)? One possibility would be specifying a custom deserializer, but that would be an overkill.

It is actually simple enough: there is another @JsonDeserialize property to use for this use case:

class Bean {
  public List<Object> values;

which also works for Map values (for Map keys you would use @JsonDeserialize.keyAs).

Similarly, you can explicitly specify deserializer to use for keys and values using @JsonDeserialize.contentUsing and @JsonDeserialize.keyUsing instead of basic @JsonDeserialize.using which is used for the property itself:

class Bean {
  public List<Object> values;

Tuesday, October 05, 2010

Lazy Coder Cuisine: Balsamic Sausage Chunks (aka Venetian Sausage Bites)

(note: I decided to use title prefix "Lazy Coder Cuisine" to make it easier to find food/drink recipes from the Blog -- who knows, maybe it could even lead to a virtual cook book over time!)

On my quest to find recipes with exceptionally high value/effort ratio, I came across this gem. It is basically a very simple snack, offering good taste (if non-stellar looks) with modest effort. To give credit where it is due I found the source recipe (Venetian Sausage Bites) at Food Network web site.

Basic idea is very simple:

  1. Take sausage -- uncooked sweet italian sasusage suggested (whatever is convenient package; 6 sausages recommended)
  2. Poach (cook in boiling water) for a bit (6 - 7 minutes); slice into bite-sized pieces
  3. Fry slightly on skillet, using as much olive oil as necessary (like 2 table spoons) to get bit of color, firm texture
  4. Add balsamic vinegar (1 cup; more if you want more sauce), simmer to reduce into syrupy sauce (15 - 20 minutes or s)

And the result is a very tasty cocktail snack, in about half an hour (and probably less if you are in a hurry). Results have interesting looks, charcoal black, which is bit acquired taste; but taste itself has broader appeal I think.

But wait! There is room for further simplification (as well as for variation) here. For one, you can just use cooked sausage (and obviously vary different kinds), which gets rid of the second step. And if you like bit of additional fat and salt, feel free not to discard melted fat after step 3.

It is actually kind of cool how flexible balsamic vinegar is; considering how different resulting reduction tastes compared to source material. Plus it is probably easy to further pimp the recipe -- I bet adding bit of cream could not hurt the sauce... maybe even add something like apple sauce for added hint of sweetness (best sauce for ham, for example, adds mustard, cream and apple sauce on fat melted from ham... and maybe dash of pepper).

The main consideration here is just what to serve these with, if anything. As cocktail snacks, can definitely serve them stand-alone. But other typical companions of sausages should work well too (serve with scrambled eggs?).

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.