Faster compression on Java platform with LZF
One open source project that I have worked on a bit lately is compress-lzf, a simple Java library that implements block- and stream-accessible version of LZF compression algorithm. LZF algorithm itself is very simple, and mostly (only?) documented here, in form of working reference implementation that doubles as documentation.
Our implementation is a simple rewrite of C version, and fully compatible with the reference implementation including file header and block signature (this is different from the other existing Java version that projects like H2 include). While simple implementation, it is not naive translation; and some effort has been spent on optimizing its performance.
And as should be the case for such low-level libraries, there are no external dependencies beyond JDK 1.5; meaning it should work on variety of platforms, including Android (in fact it could probably be compiled on any 1.x JDK)
1. What is LZF?
LZF compression algorithm is basic Lempel-Ziv,
with little additional processing; so one way to think of it is that it
is "gzip without Huffman encoding" (gzip uses Deflate, which is
Lempel-Ziv followed by Huffman encoding, with variable bit-length).
And Lempel-Ziv is just method whereby one finds repeating sub-sequences in output, replacing sequences of three or more "already seen bytes" with a back-reference; matches are typically stored in a hash table indexed by hash of first 3 bytes, value being offset of last instance of potentially matching sequence.
It is worth noting that many other similar (and sometimes similarly named) algoritms likewise use basic LZ; "Fastlz" for example is very similar.
2. Gzip exists, why bother with LZF?
But why does this matter? If gzip is this and more, why would you want to use a "lesser" algorithm?
Key benefit is that of speed: LZF can be up to twice as fast to uncompress; but where it REALLY shines is compression speed -- here it can be 2x or 3x as fast (writing a stand-alone benchmark is on my todo list, so hopefully I will get to talk a bit more about this in future). Performance difference is mostly due to relatively high overhead that deflate uses for non-byte-aligned output (which is needed for optimal Huffman-encoding of codewords) -- while it definitely helps compression, it is expensive in terms of CPU usage.
LZF compression rate is lower than with gzip, but compression profile is similar: things that compress well with gzip also compress well with LZF; and those that compress less do so with LZF as well. This makes sense given that most compression often comes from LZ algorithm, which is implemented similarly.
So it all comes down to this: gzip compression is very CPU-intensive
(read: slow); so much so that you typically use 5x as long to compress
things like JSON data than writing content itself. With fast networks
and I/O this often becomes bottleneck.
LZF can be a good alternative since its CPU overhead is much more modest; and it can still compress "fluffy" content pretty well -- where gzip can compress something by, say, 90%, LZF usually gets to 75-80% ratio (this obviously varies by types of content; basically depends on which part of deflate algorithm is more effective)
Assuming that above sounds interesting, you can get compress-lzf library either from GitHub; or via Maven. In latter case, information is:
- current version at time of this writing is "0.6.0"
And from there you can use class from com.ning.compress.lzf package:
- LZF is a simple command-line utility you can use for manual compression/uncompression (jar defines this class as the Main-Class so you can just invoke it with "java -jar compress-lzf.jar" to get usage information)
- LZFInputStream and LZFOutputStream are compressing input (read LZF compressed data) and output (write LZF compressed data) streams, similar to JDK's GZIPInputStream and GZIPOutputStream
- LZFEncoder and LZFDecoder are block-based codecs that can be used to encode (compress) and decode (uncompress) chunks (byte) of data
Both streaming and block-based interfaces have their uses; for example, when storing BLOBs in database, block API works better; and when reading data over network or from filesystem, streams make more sense.
3.1 Note: avoid LZFOutputStream.flush() ("just let it float...")
One note on LZFOutputStream: be careful with flush()ing -- just like
other block compressors (including gzip), LZF can only do
back-references within single block; and since LZFOutputStream.flush()
forces closing of an open block, it can dramatically reduce compression
rate. As such it is best to avoid flush()ing if at all possible, unless
you actually want to create a logical boundary at some point.
4. Future plans for the library
While there are always improvements for any Open Source projects, I think the natural scope for LZF library is quite small. Similar to Java Uuid Generator, it is quite possible that there won't be need for major changes to implementation or API; which means that future improvements are most likely in areas of integration; helping other projects use LZF.
For example, given unique CPU/io/bandwidth trade-off LZF can offer, it could make a really nice component for large-scale data transfer and processing, for example:
- Data processing with map/reduce (Hadoop)
- Data storage (with frameworks like jDBI, Hibernate; or with NoSQL projects like Voldemort, Cassandra)
- Data aggregation (logging, log processing, Scribe)
- Caching: use LFZ with memcached?
- Combination of high-performance data formats (Jackson Smile, binary JSON serialization) and compression -- can be combined with other use cases too.
So it will be interesting to see where things go in near future.
5. On benefits of working for a company that is comfortable with Open Source
Unlike with many earlier Open Source projects that I have worked on,
much of work that went into this open source library was done on company
time. It also was a team effort (Jon wrote the streaming handling, I
wrote block codec); compared to most other open source projects where I
have been the main author. So I have come a long way, having worked for
a semi-OS-hostile company, then a bit for a neutral (agnostic?) company;
and now working for one hat is actually engaged with community (Ning).
And I must admit that I really like that Ning is openly committed to
working with open source projects, not just using open source artifacts
others make. This is not solely meant as an ad for Ning (although we are
indeed hiring...), I just think it is important to give credit where it
Beyond this, Ning section at Github already has some public projects, and there are literally a dozen more being cleaned up to be ready for publishing. Within just a few more months I think Ning will be more recognized for its contributions in Open Source space; and eventually become a signifant player. I see this as a major benefit, making it bit less difficult to hire the best developers; and especially so for a company that is still relatively small, and thus having less name recognition than the big boys.