Saturday, December 10, 2011

Sorting large data sets in Java using Java-merge-sort

When sorting data sets in Java, life is easy if amount of data to process is not huge: JDK has the basic sorting covered well. But if your data is big enough not to fit in memory you are on your own.

This often means that developers use basic Unix 'sort' command line tool. But while it is a good package for basic textual sort -- and when combined with other Unix pipeline tools, on whole range of column-based alternatives -- it is limited in two sometimes crucial aspects:

  1. Defining custom sorting (collation) order is difficult
  2. Interacting with external tools (including 'sort') from within JVM is inherently difficult

But there is one less well-known alternative available: a relatively new Java Open Source library available from Github: java-merge-sort.

1. What is java-merge-sort

Java-merge-sort library implements basic external merge sort, sorting algorithm typically used for disk-backed sorting. Input and output are not limited to files; any java.io.InputStream / java.io.OutputStream implementation will work just fine.

Sorting library is designed to work as an ad-hoc tool (in fact, Jar itself can be used as 'sort' tool) as well as a component of bigger data processing systems.

Notable features include:

  • Fully customizable input and output handlers, used for reading external data into objects to be sorted and writing them back out (handlers defined by providing factories that create instances)
  • Optional custom comparators (if items read do not implement Comparable)
  • Configurable merge factor (number of inputs merged in each pass); max memory usage (which limits length of pre-sort segments -- more memory used, fewer rounds needed)
  • Configurable temporary file handling (defaults to using JDK default temp files, deletions)
  • Ability to cancel sorting jobs asynchronously

2. Using as command-line tool

A simple way to use the library is as a stand-alone command tool; while there are no specific benefits over standard 'sort' command (assuming one is available), it can be used to test functionality. Usage is as simple as:

  java -jar java-merge-sort-[VERSION].jar [input-file]

where 'input-file' is optional (if it is missing, will read from standard input); and sorted output will be displayed to standard output.
Commonly one will then redirect output to a file:

  java -jar java-merge-sort-[VERSION].jar unsorted.txt > sorted.txt

Under the hood, this will run code from class com.fasterxml.sort.std.TextFileSorter

which is both a concrete sorter implementation, and defines main() method to act as a command-line tool.
Sort will be done line-by-line, using basic lexicographic (~= alphabetic) sort which works for common encodings like ASCII, Latin-1 and UTF-8.
Command will limit memory usage to 50% of maximum heap.

3. Simple programmatic usage: textual file sort

More commonly java-merge-sort is used as a component of bigger processing system. So let's have a look at basic usage as 'sort' replacement, i.e. sorting text files.

Code to sort an input file into output file is:

  public void sort(InputFile in, OutputFile out) throws IOException
{
TextFileSorter sorter = new TextFileSorter(new SortConfig().withMaxMemoryUsage(20 * 1000 * 1000)); // use up to 20 megs
sorter.sort(new FileInputStream(in), new FileOutputStream(out));
// note: sort() will close InputStream, OutputStream after sorting
}

which uses default configuration except for maximum memory usage (default is 40 megs: which often works just fine)

4. Advanced usage: sort JSON files

Above example showed one benefit -- easy integration from Java code -- but the real power comes from the fact that we can change input and output handlers to deal with all kinds of data, to support advanced sorting behavior. To demonstrate this, let's consider case where input is a file that contains JSON entries: each line contains a JSON Object like:

{ "firstName" : "Joe", "lastName" : "Plumber", "age":58 }

and which we want to sort primary by age, from lowest to highest, and than by name, alphabetic, first by last name, then by first name.
We can bind this to a Java class like:


  public class Person implements Comparable<Person>
  {
    public int age;
    public String firstName, lastName;

    public int compareTo(Person other) {
     int diff = age - other.age;
     if (diff == 0) {
      diff = lastName.compareTo(other.lastName);
      if (diff == 0) {
       diff = firstName.compareTo(other.firstName);
      }
     }
     return diff;
    }
  }

using Jackson JSON processor, and then sort entries using java-merge-sort.

Code to do this is bit more complicated; let's start with Sorter implementation:


import java.io.*;

import org.codehaus.jackson.JsonGenerator;
import org.codehaus.jackson.map.*;
import org.codehaus.jackson.type.JavaType;

import com.fasterxml.sort.std.StdComparator;

public class JsonPersonSorter extends Sorter<Person>
{
  public JsonFileSorter() throws IOException {
    this(entryType, new SortConfig(), new ObjectMapper());
  }

  public JsonFileSorter(SortConfig config, ObjectMapper mapper) throws IOException {
    this(mapper.constructType(Person.class), config, mapper);
  }

  public JsonFileSorter(JavaType entryType, SortConfig config, ObjectMapper mapper) throws IOException {
    super(config, new ReaderFactory(mapper.reader(entryType)),
      new WriterFactory(mapper),
      new StdComparator<Person>());
  }
}

and supporting reading-related classes are:
public class ReaderFactory extends DataReaderFactory<Person>
{
  private final ObjectReader _reader;
  public ReaderFactory(ObjectReader r) {
    _reader = r;
  }

  @Override
  public DataReader<Person> constructReader(InputStream in) throws IOException {
    MappingIterator<Person> it = _reader.readValues(in);
    return new Reader<Person>(it);
  }
}

public class Reader<E> extends DataReader<E>
{
  protected final MappingIterator<E> _iterator;
 
  public Reader(MappingIterator<E> it) {_i terator = it; }

  @Override
  public E readNext() throws IOException {
    if (_iterator.hasNext()) {
      return _iterator.nextValue();
    }
    return null;
  }

// not a good estimation, has to do for now (should count String lengths, estimate) @Override public int estimateSizeInBytes(E item) { return 100; } @Overridepu blic void close() throws IOException { } // auto-closes when we reach end }


and writing-related classes:
static class WriterFactory<W> extends DataWriterFactory<W>
{
  protected final ObjectMapper _mapper;

  public WriterFactory(ObjectMapper m) {
    _mapper = m;
  }

  @Override
  public DataWriter<W> constructWriter(OutputStream out) throws IOException {
    return new Writer<W>(_mapper, out);
  }
}

static class Writer<E> extends DataWriter<E>
{
  protected final ObjectMapper _mapper;
  protected final JsonGenerator _generator;

  public Writer(ObjectMapper mapper, OutputStream out) throws IOException {
    _mapper = mapper;
    _generator = _mapper.getJsonFactory().createJsonGenerator(out);
  }

  @Override
  public void writeEntry(E item) throws IOException {
    _mapper.writeValue(_generator, item);
    // not 100% necesary, but for readability, add linefeeds
    _generator.writeRaw('\n');
  }

  @Override
  public void close() throws IOException {
    _generator.close();
  }
}

So with all of above, we could sort a file using: JsonFileSorter sorter = new JsonFileSorter(); sorter.sort(inputFile, outputFile);

Which is pretty much identical to earlier code to sort a File; just with different reader+writer configuration.

5. Even more advanced: compress intermediate files?

There are many ways to customize processing; and one interesting idea is to actually compress intermediate files (results of pre-sort, inputs to later merge rounds); preferably using ultra-fast Java compressor like Ning LZF.

Code to do this would not be long -- it's just matter of changing DataReaderFactory and DataWriterFactory to read/write files -- but I will leave this up as an exercise to reader. :-)

6. More speed: configurations

There are two main configuration switches that can be used to improve speed:

  1. Amount of memory used for pre-sorting: more memory to use, fewer sorted segments are needed -- in fact, it may be possible to do the whole sort in memory. Default memory to use is 40 megabytes (to accomodate for default JDK max heap size of 64 megs)
  2. Number of inputs merged per round: default is 16 inputs, which should be enough; but you can increase this to reduce number of merge rounds needed (or reduce if you want to minimize number of open files, in case you encounter problems)

7. Future ideas

Looking at JSON sorting code, I realize that it would be easy to create a generic sorter that uses Jackson. And not only would this support sorting JSON files, but also files that use any other format Jackson supports, such as Smile (out of the box, with 'SmileFactory'), XML, CSV and BSON!

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.