Saturday, December 10, 2011

Sorting large data sets in Java using Java-merge-sort

When sorting data sets in Java, life is easy if amount of data to process is not huge: JDK has the basic sorting covered well. But if your data is big enough not to fit in memory you are on your own.

This often means that developers use basic Unix 'sort' command line tool. But while it is a good package for basic textual sort -- and when combined with other Unix pipeline tools, on whole range of column-based alternatives -- it is limited in two sometimes crucial aspects:

  1. Defining custom sorting (collation) order is difficult
  2. Interacting with external tools (including 'sort') from within JVM is inherently difficult

But there is one less well-known alternative available: a relatively new Java Open Source library available from Github: java-merge-sort.

1. What is java-merge-sort

Java-merge-sort library implements basic external merge sort, sorting algorithm typically used for disk-backed sorting. Input and output are not limited to files; any java.io.InputStream / java.io.OutputStream implementation will work just fine.

Sorting library is designed to work as an ad-hoc tool (in fact, Jar itself can be used as 'sort' tool) as well as a component of bigger data processing systems.

Notable features include:

  • Fully customizable input and output handlers, used for reading external data into objects to be sorted and writing them back out (handlers defined by providing factories that create instances)
  • Optional custom comparators (if items read do not implement Comparable)
  • Configurable merge factor (number of inputs merged in each pass); max memory usage (which limits length of pre-sort segments -- more memory used, fewer rounds needed)
  • Configurable temporary file handling (defaults to using JDK default temp files, deletions)
  • Ability to cancel sorting jobs asynchronously

2. Using as command-line tool

A simple way to use the library is as a stand-alone command tool; while there are no specific benefits over standard 'sort' command (assuming one is available), it can be used to test functionality. Usage is as simple as:

  java -jar java-merge-sort-[VERSION].jar [input-file]

where 'input-file' is optional (if it is missing, will read from standard input); and sorted output will be displayed to standard output.
Commonly one will then redirect output to a file:

  java -jar java-merge-sort-[VERSION].jar unsorted.txt > sorted.txt

Under the hood, this will run code from class com.fasterxml.sort.std.TextFileSorter

which is both a concrete sorter implementation, and defines main() method to act as a command-line tool.
Sort will be done line-by-line, using basic lexicographic (~= alphabetic) sort which works for common encodings like ASCII, Latin-1 and UTF-8.
Command will limit memory usage to 50% of maximum heap.

3. Simple programmatic usage: textual file sort

More commonly java-merge-sort is used as a component of bigger processing system. So let's have a look at basic usage as 'sort' replacement, i.e. sorting text files.

Code to sort an input file into output file is:

  public void sort(InputFile in, OutputFile out) throws IOException
{
TextFileSorter sorter = new TextFileSorter(new SortConfig().withMaxMemoryUsage(20 * 1000 * 1000)); // use up to 20 megs
sorter.sort(new FileInputStream(in), new FileOutputStream(out));
// note: sort() will close InputStream, OutputStream after sorting
}

which uses default configuration except for maximum memory usage (default is 40 megs: which often works just fine)

4. Advanced usage: sort JSON files

Above example showed one benefit -- easy integration from Java code -- but the real power comes from the fact that we can change input and output handlers to deal with all kinds of data, to support advanced sorting behavior. To demonstrate this, let's consider case where input is a file that contains JSON entries: each line contains a JSON Object like:

{ "firstName" : "Joe", "lastName" : "Plumber", "age":58 }

and which we want to sort primary by age, from lowest to highest, and than by name, alphabetic, first by last name, then by first name.
We can bind this to a Java class like:


  public class Person implements Comparable<Person>
  {
    public int age;
    public String firstName, lastName;

    public int compareTo(Person other) {
     int diff = age - other.age;
     if (diff == 0) {
      diff = lastName.compareTo(other.lastName);
      if (diff == 0) {
       diff = firstName.compareTo(other.firstName);
      }
     }
     return diff;
    }
  }

using Jackson JSON processor, and then sort entries using java-merge-sort.

Code to do this is bit more complicated; let's start with Sorter implementation:


import java.io.*;

import org.codehaus.jackson.JsonGenerator;
import org.codehaus.jackson.map.*;
import org.codehaus.jackson.type.JavaType;

import com.fasterxml.sort.std.StdComparator;

public class JsonPersonSorter extends Sorter<Person>
{
  public JsonFileSorter() throws IOException {
    this(entryType, new SortConfig(), new ObjectMapper());
  }

  public JsonFileSorter(SortConfig config, ObjectMapper mapper) throws IOException {
    this(mapper.constructType(Person.class), config, mapper);
  }

  public JsonFileSorter(JavaType entryType, SortConfig config, ObjectMapper mapper) throws IOException {
    super(config, new ReaderFactory(mapper.reader(entryType)),
      new WriterFactory(mapper),
      new StdComparator<Person>());
  }
}

and supporting reading-related classes are:
public class ReaderFactory extends DataReaderFactory<Person>
{
  private final ObjectReader _reader;
  public ReaderFactory(ObjectReader r) {
    _reader = r;
  }

  @Override
  public DataReader<Person> constructReader(InputStream in) throws IOException {
    MappingIterator<Person> it = _reader.readValues(in);
    return new Reader<Person>(it);
  }
}

public class Reader<E> extends DataReader<E>
{
  protected final MappingIterator<E> _iterator;
 
  public Reader(MappingIterator<E> it) {_i terator = it; }

  @Override
  public E readNext() throws IOException {
    if (_iterator.hasNext()) {
      return _iterator.nextValue();
    }
    return null;
  }

// not a good estimation, has to do for now (should count String lengths, estimate) @Override public int estimateSizeInBytes(E item) { return 100; } @Overridepu blic void close() throws IOException { } // auto-closes when we reach end }


and writing-related classes:
static class WriterFactory<W> extends DataWriterFactory<W>
{
  protected final ObjectMapper _mapper;

  public WriterFactory(ObjectMapper m) {
    _mapper = m;
  }

  @Override
  public DataWriter<W> constructWriter(OutputStream out) throws IOException {
    return new Writer<W>(_mapper, out);
  }
}

static class Writer<E> extends DataWriter<E>
{
  protected final ObjectMapper _mapper;
  protected final JsonGenerator _generator;

  public Writer(ObjectMapper mapper, OutputStream out) throws IOException {
    _mapper = mapper;
    _generator = _mapper.getJsonFactory().createJsonGenerator(out);
  }

  @Override
  public void writeEntry(E item) throws IOException {
    _mapper.writeValue(_generator, item);
    // not 100% necesary, but for readability, add linefeeds
    _generator.writeRaw('\n');
  }

  @Override
  public void close() throws IOException {
    _generator.close();
  }
}

So with all of above, we could sort a file using: JsonFileSorter sorter = new JsonFileSorter(); sorter.sort(inputFile, outputFile);

Which is pretty much identical to earlier code to sort a File; just with different reader+writer configuration.

5. Even more advanced: compress intermediate files?

There are many ways to customize processing; and one interesting idea is to actually compress intermediate files (results of pre-sort, inputs to later merge rounds); preferably using ultra-fast Java compressor like Ning LZF.

Code to do this would not be long -- it's just matter of changing DataReaderFactory and DataWriterFactory to read/write files -- but I will leave this up as an exercise to reader. :-)

6. More speed: configurations

There are two main configuration switches that can be used to improve speed:

  1. Amount of memory used for pre-sorting: more memory to use, fewer sorted segments are needed -- in fact, it may be possible to do the whole sort in memory. Default memory to use is 40 megabytes (to accomodate for default JDK max heap size of 64 megs)
  2. Number of inputs merged per round: default is 16 inputs, which should be enough; but you can increase this to reduce number of merge rounds needed (or reduce if you want to minimize number of open files, in case you encounter problems)

7. Future ideas

Looking at JSON sorting code, I realize that it would be easy to create a generic sorter that uses Jackson. And not only would this support sorting JSON files, but also files that use any other format Jackson supports, such as Smile (out of the box, with 'SmileFactory'), XML, CSV and BSON!

Wednesday, September 28, 2011

Advanced filtering with Jackson, Json Filters

I wrote a bit earlier on "filtering properties with Jackson". While it was comprehensive in that all main methods of filtering were covered, there wasn't much depth. Specifically, only very basic usage of Json Filters (@JsonFilter annotation, SimpleFilterProvider as provider) was considered. This approach does allow more dynamic filtering than, say, @JsonView, but it is still somewhat limited. So let's consider more advanced customizability.

1. Refresher on Json Filters

Ok, so the basic idea with Json Filters is that:

  1. Classes can have an associated Filter Id, which defines logical filter to use.
  2. A provider is needed to get the actual filter instance to use, given id: this will be configured by assigning a FilterProvider (such as 'SimpleFilterProvider') to ObjectMapper or ObjectWriter.
  3. Jackson will dynamically (and efficiently) resolve filter given class uses, dynamically, allowing per-call reconfiguration of filtering.

From this it is clear that there are 2 main things you can configure: mechanism that is used to find Filter id of a given class, and mechanism used for mapping this id to actual filter used (implementation of which can be as complicated as you want).

So let's have a look at both parts.

2. Configuring mapping from id to filter instance

Of mechanisms, latter one may be easier to understand and use: one just has to implement 'FilterProvider', which has but one method to implement:

  public abstract class FilterProvider {
    public abstract BeanPropertyFilter findFilter(Object filterId);
  }

given this, 'SimpleFilterProvider' is little more than a Map<String,BeanPropertyFilter>, except for adding couple of convenience factory methods that build 'SimpleBeanPropertyFilter' instances given property names, so you typically just instantiate one with calls like:

  SimpleBeanPropertyFilter filter = SimpleBeanPropertyFilter.filterOutAllExcept("a"));

which would out all properties except for one named "a". This filter is then configured with ObjectMapper like so:

  FilterProvider fp = new SimpleFilterProvider().addFilter("onlyAFilter", filter);
  objectMapper.writer(fp).writeValueAsString(pojo);

which would, then, apply to any Java type configured to use filter with id "onlyAFilter".

3. Configuring discovery of filter id

From above example we know we need to indicate classes that are to use our "onlyAFilter". The default mechanism is to use:

  @JsonFilter("onlyAFilter")
  public class FilteredPOJO {
    //...
  }

But this is just the default. How so? The way Jackson figures out its annotation-based configuration is actually indirect, and fully customizable: all interaction is through configured 'AnnotationIntrospector' object, which amongst other things defines this method:

  public Object findFilterId(AnnotatedClass ac);

which is called when serializer needs to determine id of the filter to apply (if any) for given class. Since the default implementation (org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector) has everything else working fine, what we can do is to sub-class it and override this method.
For example:

  public class MyFilteringIntrospector extends JacksonAnnotationIntrospector
  {
    @Override
    public Object findFilterId(AnnotatedClass ac) {
      // First, let's consider @JsonFilter by calling superclass
      Object id = super.findFilterId(ac);
      // but if not found, use our own heuristic; say, just use class name as filter id, if there's "Filter" in name:
      if (id == null) {
        String name = ac.getName();
        if (name.indexOf("Filter") >= 0) {
          id = name;
        }
      }
      return id;
    }
  }

Above functionality is just to show what is possible, not that it makes sense. Alternatively you could of course define your own annotations to check; or have List of known class names, check class definition or interfaces type implements. The main point is just that you are not limited to using @JsonFilter annotation, but can use pretty much any logic you want, within limits of your coding skills.

The only caveat is that the resolution from Class to matching id is only guaranteed to be called once per ObjectMapper; so any variation in filtering of specific class needs to happen at either mapping of id to filter, or within filter itself.

4. Don't be afraid of sub-classing (Jackson)AnnotationIntrospector

Actually, the key take away might as well be the fact that AnnotationIntrospector is designed to be customizable. It was initially created to allow easy reuse of JAXB annotations (via JAXBAnnotationIntrospector; combining things with AnnotationIntrospector.Pair); but it is also a very powerful general-purpose customization mechanism. But at this point quite underused one at that.

5. Addendum

Some additional notes based on feedback I received:

  • Custom BeanPropertyFilter implementations are obviously powerful too: not only can they completely change what (if anything) gets written for property, they can base this on all configuration accessible via SerializerProvider which is passed to serializeAsField(): for example, it can check to see what serialization view is available by calling 'provider.getSerializationView()'.

Monday, April 04, 2011

Introducing "jvm-compressor-benchmark" project

I recently started one new open source project; this time being inspired by success of another OS project I had been involved in, project is "jvm-serializers" benchmark originally started by Eishay and built by a community of java databinder/serializer experts. What has been great with this project has been amount of energy it seemed ot feed back to development of serializers: highly visible competition for best performance seems to have improved efficiency of libraries a lot. I only wish we had historical benchmark data to compare to see exactly how far have the fastest Java serializers come.

Anyway, I figured that there are other groups of libraries where high performance matters, but where there is lack of actual solid benchmarking information. So while there are a few compression performance benchmarks, they are often non-applicable for Java developers: partly because they just compare native compressor codecs, and partly because focus is more often only on space-efficiency (how much compression is achieved) with little consideration of performance of compression. The last part is particularly frustrating as in many use cases there is significant trade-off between space and time efficiency (compression rate vs time used for compression).

So, this is where the new project -- "jvm-compressor-benchmark" -- comes from. I hope it will allow fair comparison of compression codecs available on JVM, to be used by Java and other JVM lagnuages; and also bring in some friendly competition between developers of compression codecs.

First version compares half a dozen of compression formats and codecs, from the venerable deflate/gzip (which offers pretty good compression ratio with decent speed) to higher-compression-but-slower-operation alternatives (bzip2) and lower-compression-but-very-fast alternatives like lzf, quicklz and the new kid on the block, Snappy (via JNI).

And although the best way to evaluate results is to run tests on your machine, using data sets you care about (which I strongly encourage!), Project wiki does have some pretty diagrams for tests run on "standard" data sets gathered from the web.

Anyway: please check the project out -- at the very least it should give you an idea of how many options there are above and beyond basic JDK-provided gzip.

ps. Contributions are obviously also welcome -- anyone willing to tackle Java version of 7-zip's LZMA, for example, would be most welcome!

Saturday, March 12, 2011

Non-blocking XML parsing with Aalto 0.9.7

Aalto XML processor (see home page) is known for two things:

  1. It is the fastest Java-based XML parser available (for example, see jvm-serializers benchmark, or this comparison); both for Stax and SAX parsing
  2. It is the only open-source Java parser that can do non-blocking parsing (aka asynchonous, or async, parsing)

Former is relatively easy to figure out: given that Aalto implements two standard low-level Java streaming parsing APIs -- Stax and SAX -- you can easily switch Aalto in place of Woodstox or Xerces and see how fast it is. For many common types of XML data, it is almost exactly twice as fast for parsing as Woodstox (which itself is generally faster than alternatives like Xerces/SAX); and it is also bit faster for writing XML content.

But non-blocking parsing is more difficult to evaluate. This is because there are no other non-blocking Java XML parsers, nor real documentation for use of non-blocking part of Aalto; and also because this part of functionality has been only completed fairly recently (while some parts of functionality were written up to two years ago, last pieces were completed just for the latest official release).

So I will try to explain basic non-blocking operation here. But first, brief introduction to non-blocking parsing, using Aalto's non-blocking Stax extension. Non-blocking variant of SAX will be completed before Aalto 1.0 is released.

1. Non-Blocking / Async operation for XML

Basic feature of non-blocking parsing is that it does not rely on blocking input (InputStream or Reader). Instead of parser using a stream or reader to read content, and blocking the thread if none is available, content is rather "pushed" to parser; and parser will give out processed events if there is enough content available. This is similar to how many C parsers work; as well as operation of Java's gzip/zip/deflate codecs (java.util.zip.Deflater).

The main benefit of non-blocking operation is ability to process multiple XML input sources without having to allocate one thread per source, same benefit as that NIO has for basic web services. And in fact, having a non-blocking parser is something that could benefit non-blocking web services a lot: without such parser, services must buffer all the input before parsing, to ensure that no blocking occurs.

So why does it matter that there need not be as many threads as sources? While Java threading efficiency has improved a lot over time, it can still be hard to scale systems that use more than hundreds of threads (or low thousands; exact number depends on platform). So systems that are highly concurrent, but typically have high latencies, or highly varying workloads, cand benefit from this mode of operation.
In addition, another related benefit is that memory usage of non-blocking parser can be more close bounded: since limited amount of input is buffered at any given point, amount of working memory can be more limited (at least when not forcing coalecing of XML text segments).

On downside, writing code to use non-blocking parsing can be slightly more complex to write: and given lack of standardized APIs, it is something new to learn. And since regular blocking I/O can scale quite well nowadays for many (or most) uses, non-blocking parsing is not something one generally starts doing initially. But it can be a very useful technique for subset of all XML processing use cases.

2. Non-blocking XML parsing using Aalto API

The easiest way to explain operation is probably by showing piece of sample code (lifted from Aalto unit tests). Here we will actually construct a static XML document from String (for demonstration purposes: in real systems, it would be read via NIO channels or a higher-level non-blocking abstraction), and feed it into parser, single byte at a time. In actual production use one would typically feed content block at a time; either fully read blocks, or chunks of contents as soon as they become available. Aalto does not implement higher-level buffer management (there is just one active buffer), although adding basic buffer handling would not be difficult; it just tends to be either provided by input source (Netty), or be input source specific.


  byte[] XML = "<html>Very <b>simple</b> input document!</html>";
  AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
  final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
  int inputPtr = 0; // as we feed byte at a time
  int type = 0;

  do {
    // May need to feed multiple "segments"
    while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
      feeder.feedInput(buf, inputPtr++, 1);
if (inputPtr >= XML.length) { // to indicate end-of-content (important for error handling)
feeder.endOfInput(); } } // and once we have full event, we just dump out event type (for now) System.out.println("Got event of type: "+type); // could also just copy event as is, using Stax, or do any other normal non-blocking handling: // xmlStreamWriter.copyEventFromReader(asyncReader, false); } while (type != END_DOCUMENT); asyncReader.close();

And that's it. There are actually just couple of additional things needed to do non-blocking parsing:

  1. Use of regular Stax API, with just a single extension, introduction of new token, EVENT_INCOMPLETE (com.fasterxml.aalto.AsyncXMLStreamReader.EVENT_INCOMPLETE), which is returned if there isn't enough content buffered to fully construct a token to return
  2. Feeding of content using AsyncInputFeeder (instance of which is accessed via AsyncXMLStreamReader, extension of basic XMLStreamReader)
  3. Indicating end-of-content via feeder when all content has been read

Which makes operation bit more complicated than use of straight XMLStreamReader, but not significantly so.

3. Next steps

There are two things that Aalto non-blocking mode does not yet implement, which will be finished before Aalto becomes 1.0:

  • Coalescing mode has not been implemented for non-blocking Stax. Since use of coalescing (of all adjacent text segments, as per Stax spec) is probably less important for non-blocking use cases than blocking ones (as it will increase need for buffering, possible increase latency), it was less as the last major piece to be completed.
  • There isn't yet non-blocking SAX mode. This should be relatively easy to implement, and should not require extensions to SAX API itself (one just has to call "XMLReader.parse()" multiple times; but as it is based on same parser core as Stax mode, it has not yet been completed.

At this point what is needed most is actual usage: while there is some test coverage, non-blocking mode is less well tested than blocking mode: blocking mode can use full basic StaxTest suite, used succesfully for years with Woodstox (and for Aalto for more than a year as well).

Monday, February 28, 2011

Jackson: not just for JSON, Smile or BSON any more -- Now With XML, too!

One of first significant new Jackson extension projects (result of Jackson 1.7 release which made it much easier to provide modular extensions) is jackson-xml-databind, hosted at GitHub. Although this extension is still in its pre-1.0 development phase, the latest released version is fully usable as is and is even in some limited production use by some brave developers (running on Google AppEngine, of all things!).

So it is probably a good idea to now give a brief overview of what this project is all about.

1. What is jackson-xml-databind?

Jackson-xml-databind comes in a small package (jar is only about 55 kB) , and is used with Jackson data binding functionality (jackson-mapper jar). It provides basic replacement for JsonFactory, JsonParser and JsonGenerator components of Jackson Streaming API, and allows reading and writing of XML instead of JSON, in context of generic Jackson data binding functionality. In addition, core ObjectMapper is also sub-classed to provide customized versions of couple of other provider types, so typically all usage is done by creating com.fasterxml.jackson.xml.XmlMapper instead of ObjectMapper, and using it for data binding.

2. What is it used for?

This package is used to read XML and convert it to POJOs, as well as to write POJOs as XML. In this respect it is very similar to JAXB (javax.xml.bind) package; and an alternative for many other Java XML data binding packages such as XStream and JibX. Given Jackson support for JAXB annotations, it can be especially conveniently used as a JAXB replacement in many cases.

Functionality supported is in some ways a subset of JAXB, and in other ways a superset: XML-specific functionality is more limited (no explicit support for XML Schema), but general data binding functionality is arguably more powerful (since it is full set of Jackson functionality).

Two obvious benefits of this package compared to JAXB or other existing XML data binding solutions (like XStream) are superior performance -- with fast Stax XML parser, this is likely the fastest data binding solution on Java platform (see jvm-serializers for results) -- and extensive and customizable data POJO conversion functionality, using all existing Jackson annotations and configuration options. The main downside currently is potential immaturity of the package; however, this only applies to interaction between mature XML packages (stax implementation) and Jackson data binder (which is also fairly mature at this point).

3. So how do I use it?

If you know how to use Jackson with JSON, you know almost everything you need to use this package. The only other thing you need to know is that there has to be a Stax XML parser/generator implementation available. While JDK 1.6 provides one implementation, your best best is using something bit more efficient, such as Woodstox or Aalto. Both should work fine; Aalto is faster of two, but Woodstox is a more mature choice. So you will probably want to include one of these Stax implementations when using jackson-xml-databind.

Other than this, all you need to do is to construct XmlMapper:

  XmlMapper mapper = new XmlMapper(); // can also specify XmlFactory to 
  use, to override Stax factories used

and use it like you would any other ObjectMapper, like so:

  User user = new User(); // from Jackson-in-five-minutes sample
String xml = mapper.writeValueAsString(user);

and what you would get is something like:

<User>
  <name>
    <first>Joe</first>
    <last>Sixpack</last>
  </name>
  <verified>true</verified>
  <gender>MALE</gender>
  <userImage>AQIDBAU=</userImage>
</User>

which is equivalent of JSON serialization that would look like:

{
  "name":{
    "first":"Joe",
    "last":"Sixpack"
  },
  "verified":true,
  "gender":"MALE",
  "userImage":"AQIDBAU="
}

Pretty neat eh?

Oh, and reverse direction obviously works similarly:

  User user = mapper.readValue(xml, User.class);

There is really nothing extra-ordinary in it usage; just another way to use Jackson for slicing and dicing your POJOs.

4. Limitations

While existing version works pretty well in general, there are some limitations. These mostly stem from the basic difference between XML and JSON logical models; and specifically affect handling of Lists/arrays. XmlMapper for example only allows so-called "wrapped" lists (for now); meaning that there is one wrapper XML element for each List or array property, and separate element for each List item.

Compared to JAXB (and related to JAXB annotation support), no DOM support is included; meaning, it is not possible to use converters that take or produce DOM Elements.

With respect to Jackson functionality, while polymorphic type information does work, some combinations of settings may not work as expected.

And given project's pre-1.0 status, testing is not yet as complete as it needs to be, so other rough edges may also be found. But with help of user community I am sure we can polish these up pretty quickly.

5. Feedback time!

So what is needed most at this point? Users, usage, and resulting bug (or, possibly, success) reports! Seriously, more usage there is, faster we can get the project up to 1.0 release.

Happy hacking!

Sunday, February 06, 2011

On prioritizing my Open Source projects, retrospect #2

(note: related to original "on prioritizing OS project", as well as first retrospect entry)

1. What was the plan again?

Ok, it has been almost 4 months since my last medium-term high-level priorization overview. Planned list back then had these entries:

  1. Woodstox 4.1
  2. Aalto 1.0 (complete async API, impl)
  3. Jackson 1.7: focus on extensibility
  4. ClassMate 1.0
  5. Externalized Mr Bean (not dependant on Jackson)
  6. StaxMate 2.1
  7. Tr13 1.0

2. And how have we done?

Looks like we got about half of it done. Point by point:

  1. DONE: Woodstox 4.1 (with 4.1.1 patch release)
  2. Almost: Aalto 1.0 -- half-done; but significant progress, API is defined, about half of implementation work done
  3. DONE: Jackson 1.7 (with 1.7.1 and 1.7.2 patch releases)
  4. Almost: ClassMate 1.0 not completed; version 0.5.2 released, javadocs publisher, minor work remains
  5. Deferred: Externalized Mr Bean -- no work done (only some preliminary scoping)
  6. DONE? StaxMate 2.1 -- released 2.0.1 patch instead that contains fixes to found issues, but no new features, which would defined 2.1.
  7. Some work done: Tr13: incremental work, but no definite 1.0 release (did release 0.2.5 patch version with cleanup)

I guess it is less than half since only 2 things were fully completed (or 3 if StaxMate 2.0.1 counts). But then again, of remaining tasks only one did not progress at all; and many are close to being completed (in fact, I was hoping to wrap up Aalto before doing update). And ones referred were lower entries on the list.

On the other hand, I did work on a few things that were not on the list. For example:

  • Started "jackson-xml-databinding" project (after Jackson 1.7.0), got first working version (0.5.0)
  • Started multiple other Jackson extension projects (jackson-module-hibernate, jackson-module-scala), with working builds and somewhat usable code; these based on code contributed by other Jackson developers
  • Started "java-cachemate" project, designed concept and implemented in-memory size-limited-LRU-cache (used already in a production system)

This just underlines how non-linear open source development can be; it is often opportunistic -- but necessarily in negative way -- and heavily influenced by feedback, as well as newly discovered inter-dependencies, and -opportunities.

3. Updated list

Let's try guestimating what to do going forward, then, shall we. Starting with leftovers, we could get something like:

  • Aalto 1.0: complete async implementation; do some marketing
  • ClassMate 1.0: relatively small amount of work (expose class annotations)
  • Java CacheMate: complete functionality, ideally release 1.0 version
  • Tr13: either complete 1.0, or augment with persistence options from cachemate (above)
  • Externalized Mr Bean? This is heavily dependant on external interest
  • Jackson 1.8: target most-wanted features (maybe external type id, multi-arg setters)
  • Jackson-xml-databinding 1.0: more testing, fix couple known issues
  • Work on Smile format; try to help with libsmile (C impl), maybe more formal specification; performance measurements, other advocacy; maybe even write a javascript codec

Other potential work could include:

  • StaxMate 2.1 with some new functionality
  • Woodstox 5.0, if there is interest (raise JDK minimum to 1.5, maybe convert to Maven build)
  • Jackson-module-scala: help drive 1.0 version, due to amount of interest in full Scala support
  • Jackson-module-csv: support data-binding to/from CSV -- perhaps surprisingly, much of "big data" exists as plain old CSV files...

But chances are that above lists are also incomplete... let's check back in May, on our first "anniversary" retrospect.

Thursday, February 03, 2011

Why do modularity, extensibility, matter?

After writing about Jackson 1.7 release, I realized that while I described what and how was done to significantly improve modularity and extensibility of Jackson, I did not talk much about why I felt both were desperately needed. So let's augment that entry with bit more background, fill in the blanks.

Two things actually go together such that while modularity in itself is somewhat useful, it is extremely important when it is coupled with extensibility (and conversely it is hard to be extensible without being modular). So I will consider them together, as "modular extensibility", in what follows.

1. Distributed development

The most obvious short-term benefit of better modularization, extensibility, is that it actually allows simple form of distributed development, as additional extension modules (and projects under which they are created) can be built independent from the core project. There are dependencies, of course -- modules may need certain features of the core library -- but this much looser coupling than having to actually work within same codebase, coordinating changes. This alone would be worth the effort.

But the need for distribution stems from the obvious challenge with Jackson's (or any smilar project's) status quo: that the core project, and its author (me) can easily become a bottleneck. This is due to coordination needed, such as code reviews, patch integration; much of which is most efficiently done with simple stop-and-wait'ish approach. While it is possible to increase concurrency within one project and codebase (with lots of additional coordination, communication, both of which are hard if activity levels of participants fluctuate), it is much easier and more efficient to do this by separate projects.

Not all projects can take the route we are taking, since one reason such modularity is possible is due to expansion of the project scope: extensions for new datatypes are "naturally modular" (conceptually at least; implementation-wise this is only now becoming true), and similarly support for non-Java JVM languages (Scala, Clojure, JRuby) and non-JSON data formats (BSON, xml, Smle). But there are many projects that could benefit from more focus on modular extensibility.

2. Reduced coupling leads to more efficient develo[ment

Reduced coupling between pieces of functionality in turn allows for much more efficient development. This is due to multiple factors: less need for coordination; efficiency in working on smaller pieces (bigger projects, as companies, have much more inherent overhead, lower productivity); shorter release cycles. Or, instead of canonically shorter development and release cycles, it is more accurate to talk about more optimal cycles: new, active projects can have shorter cycles, release more often, and more mature, slower moving (or ones with more established user base and hence bigger risks from regression) can choose slower pace. The key point is that each project can choose most optimal rate of releases, and only synchronize when some fundamental "platform" functionality is needed.

As an example, core Jackson project has released a significant new version every 3 - 6 months. While this is pretty respectable rate in itself, it is glacial pace compared to releases for, say, "jackson-xml-databinding" module, which might release new versions on weekly basis before reaching its 1.0 version.

3. Extending and expanding community

This improved efficiency is good just in itself, but I think it will actually make it easire to extend and expand community. Why? Because starting new projects and getting releases out faster should make it easier to join, get started and productive, and thereby lower threshold for participation. In fact I think that we are going to quickly double and quadruple number of active contributors quite soon, when everyone realizes potential for change; how easy it is to get to expand functionality in a way that everyone can share the fruits of labor. Previously best methods have been to write a blog entry about using a feature, or maybe report a bug; but now it will be trivially easy to start playing with new kinds of reusable extension functionality.

4. Modules are the new core

Given all the benefits of the increased modularity I am even thinking of further splitting much of existing "core" (meaning all components under main Jackson project; core, mapper, xc, jax-rs, mrbean, smile) as modules. All jars except for core and mapper would themselves work as modules (or similar extensions); and many features of mapper jar could be extracted out. The main reason for doing this would actually be to allow different release cycles: jax-rs component, for example, has changed relatively little since 1.0: there is no real need to release new version of it every time there is a new mapper version. In fact, of 6 jars, mapper is the only one that is constantly changing; others have evolved at much slower pace.

But even if core components were to stay within core Jackson project, most new extension functionality to be written will be done as new modules.

Wednesday, February 02, 2011

Jackson 1.7; quest for Maximum Extensibility

At this point Jackson 1.7 has been out for almost a month (and in fact, 1.7.2 is by now the latest patch release), so it's high time to write something about this release.
1.7 turns out to be third "anything but minor" minor release in a row, which is part of the reason why I have procrastinated a bit: it is not a simple matter of just listing set of simple features, or linking the release notes page (which can be found here, for anyone interested). Rather, it makes sense to talk a bit about 1.7 development cycle.

But it is actually good that I have had some time to think about what to write, instead of rushing to document release that just happened: especially since there is now some progress that was directly germinated by this release. But more on this bit later.

1. Background

After 1.6, a whopper of a release that boasted 4 major new featurers and a boatload of smaller ones, the initial plan for 1.7 was to make a somewhat smaller incremental release. Beyond tackling some fixes that required API changes (and thus couldn't go in one of 1.6.x patch releases), the focus was on the most important concern at the time: difficult in cleanly extending Jackson with modular extensions. So it seemed like this might be a modest incremental upgrade.

It was quickly found out that needed changes to allow modular extensibility were quite wide-spread, since information needed was not propagated through all the pieces. But the focus on a single cross-cutting concern turned out to be a good thing, so that major changes to interfaces could be done in one fell swoop and hopefully abstractions added (and changes to existing ones) will form a solid foundation for further development.

2. Aspects of extensibility

While the main goal was to improve extensibility, there are multiple kinds changes that are needed to support proper modular extensibility. For example:

  • Changes to allow registration of bundles of new functionality in a way that it is possible to add multiple extensions that ideally do not conflict, and that need not even be aware of other extensions that may be used.
  • Retrofitting existing components and interfaces to allow clean extension (i.e. avoiding having to sub-class things)
  • Adding new extension points to replace older extension methods
  • Making existing extension points more powerful, to further reduce need for more invasive techniques (overrides with sub-classing)

Another way to consider this is to think of Jackson becoming a platform; the way a web browser can be seen a platform to build on (via addition of plug-ins and add-ons). In fact, given new projects that support many non-JSON data formats (see below), it is not a strecth to claim that Jackson is becoming a "Java data format conversion platform" at this point.

3. New mechanism for registering extensions: Module API

The most visible new construct is Module API. It is also amongst simplest, since there are basically just two things a Jackson Module developer needs to learn:

  1. org.codehaus.jackson.map.Module interface, which must be implemented by one class of the module; and specifically its "setupModule(SetupContext ctxt)" method (other methods are for exposing metadata such as module version)
  2. Module.SetupContext (passed to "setupModule" method) that exposes set of extension points (methods) that module can use to register handlers it wants to add.

And from user end point, it is even simple; there is but one thing to know. To use, say, new Jackson-guava-module (available from FasterXML GitHub repository; provides support for reading/writing Guava data types), you will do:

  ObjectMapper mapper = new ObjectMapper();
  mapper.registerModule(new GuavaModule());

that is, add a one-line call to let module register whatever it wants to offer, via interfaces that ObjectMapper provides it.

From above description, definition of a Jackson module is quite simple: it is piece of code that defines one class that implements org.codehaus.jackson.map.Module, and which registers all functionality offered by the module.

3.1. Module interface: not just for extensions -- use for your own app too!

One thing worth noting is that while Module interface is really designed to allow writing of reusable third-party extensions, it actually works pretty well just for encapsulating ObjectMapper configuration and extensions that are only used by a single application, or company-wide (but not published externally). So it is a good idea to use modules, for example, when registering custom serializers and deserializers; there is no overhead and this helps in encapsulating configurability and customization in one place.

4. Modular extension points: Serializers, Deserializers

Beyond having a simple registration mechanism for extensions (which I will from here on simply refer as "modules"), the obvious problem with extensibility has been that it has been limited to application developer being able to override custom behavior, either by setting an explicit handler, or by sub-classing and replacing existing components (like SerializerFactory). True extensibility requires that it must be possible for multiple modules to add handlers without overriding each other's changes (unless they happen to truly conflict like trying to define handler for same data type); ability for modules to peacefully co-exist, co-operate without explicitly having to plan for it.

The first obvious thing was to add mechanisms for adding custom serializers and deserializers without having to replace default SerializerFactory and DeserializerFactory instances. This was done by adding new interfaces org.codehaus.jackson.map.Serializers and org.codehaus.jackson.map.Deserializers (and matching basic implementations), which just define a way for a module to provide serializers and deserializers for specific data types. These can then be registered with SerializerFactory.withAdditionalSerializers(Serializers) and DeserializerFactory.withAdditionalDeserializers(Deserializers); which is exactly what ObjectMapper exposes via SetupContext.setupModule() method.

These simple extension points alone cover much of what most module need to do: to provide specific handlers for third party libraries. And when using org.codehaus.jackson.map.module.SimpleModule (default implementation of Module), addition of these handlers is a one-line operation.

5. Modular extension points: BeanSerializerModifier, BeanDeserializerModifier

But beyond ability to conveniently register deserializers and serializers, it was understood that ability to modify functioning of standard BeanSerializer and BeanDeserializer instances (things that take your POJOs, find out properties, handle annotations and pretty much do most of the magic Jackson provides) is a definite must. This because in most cases much of existing functionality is fine, but there is need to tweak specific aspects of serialization or deserialization: for example, one may want to override handling of just one specific property, for specific class of POJOs. And while annotations can configure many things well, there are limitations.

To support this, two new interfaces (and matching registration methods, added in Module.SetupContext) were added: BeanSerializerModifier and BeanDeserializerModifier.

Methods defined in these interfaces are called during (and right after) building BeanSerializer and BeanDeserializer instances; and can be used for example to:

  1. Add or remove properties to be serialized, deserialized
  2. Change the order in which properties are serialized
  3. Completely replace BeanSerializer/-Deserializer that has been built, with specified JsonSerializer/JsonDeserializer (this is often done by constructing a new BeanSerializer / BeanDeserializer, using some properties from initial serializer/deserializer)

Which pretty much means that the whole serializer and deserializer configuration and construction process can be modified; but without having to replace everything. Possibilities are unlimited.

6. Contextual configuration of serializers, deserializers

While ability to change the way bean serializers, deserializers are configured and constructed is powerful, there was one other aspect of construction process that needed revamping. Up until version 1.6, once a serializer (deserializer) was constructed for a given type, same instance was used for properties of that type. This meant that any context-specific behavior (serialization of a field of specific type being handled differently, depending on which exact property is being serialized) was hard to do; and basically could not be done from within serializer or deserializer implementation.

Consider something that would seem like a simple extension: ability to define which DateFormat to use for serializing specific properties. For example, we might want something like:

  public class Bean {
    @JsonDateFormat("YYYY-MM-DD")
public Date createDate; }

in which 'createdDate' property would be serialized using specified DateFormat, instead of the default DateFormat mapper uses.

Problem is two-fold: first of all, JsonSerializer/JsonDeserializer does not get enough contextual information to do much configuration. But worse, even if it did, there would be just one instance that is used regardless of location of property. So the only way (pre-1.7) to implement such feature would be to explicitly add support within core Jackson data binder; BeanSerializerFactory and AnnotationIntrospector would need to be modified at minimum.

One obvious way to solve the problem would have been to pass contextual information during serialization/deserialization. But while this would be a powerful mechanism, it would add significant amount of overhead, especially if configuration was to be done using annotations. Instead we decided to pass this information during construction of serializer/deserializer instance; from design perspective this is compatible with the general goal of trying to gather as much information as possible during non-performance-critical phase of constructing handlers, and minimize work to be done during performance-critical serialization phase.

Specific mechanism chosen is that of defining two interfaces (ContextualSerializer, ContextualDeserializer) that serializer and deserializer instances can implement. And if they do, SerializerProvider / DeserializerProvider will first construct instance, and then call methods in new interfaces, to allow creation of contextual instances, passing information about context in form of BeanProperty instance which gives property name and access to all related annotations (as well as currently active configuration).

With this information it will be possible to support use cases such as one explained below: in fact, unit tests used to verify functionality define trivial serializer types (like StringSerializer that can conditionally lower-case property values based on existence of a test annotation).

7. From theoretical to practical extensibility

While it has been just 4 weeks since the release, extensibility improvements outlined above have already been made good use of by multiple projects. I am aware of at least following extension projects (please let me know of others if you know):

  • bson4jackson (support for BSON format (used by MongoDB))
  • jackson-module-scala (support Scala data types) (there is also another noteworthy Scala-with-Jackson project, Jerkson)
  • jackson-module-hibernate (support lazy-loaded Hibernate types):
  • jackson-module-guava (support google Guava collection types)
  • jackson-xml-databind (support reading/writing XML instead of JSON, "mini-JAXB") -- I will definitely need to write bit more about this in near future (can't use XStream or JAXB at GAE? jackson-xml-databind actually can be -- and it is much faster than either on J2SE platform as well)

and new ones are bound to come up (there have been talks for adding Joda-module, CSV-module for example)

8. Beyond extensibility: other new features, improvements:

As imporant as extensibility (and benefits it brings, such as new modules!), 1.7 actually contains a few important other improvements and new features that are not directly related to extensibility. Here's a quick list of most noteworthy ones:

  • @JsonTypeInfo can now be used for properties (fields, getter/setter methods), not just types (classes) -- useful for "untyped" fields (like ones using java.lang.Object as value), so one need not enable default type information
  • Dynamic Filtering: powerful new filtering mechanism using @JsonFilter to specify filter id, ObjectMapper.filteredWriter(FilterProvider) to specify which id maps to which filter -- this is a major new feature, and I hope to write more about it too (
  • Support for wrapping output within "root name" (similar to JAXB), for interoperability with other JSON tools, frameworks
  • @JsonRawValue for injecting "raw text" (such as pre-encoded JSON without re-parsing) during serialization
  • SerializedString for high-efficiency serialization of pre-encoded (quoted, utf-8 encoded) String values, property names
  • Feature to enable/disable wrapping of runtime exceptions (separately for serialization, deserialization)

Tuesday, December 21, 2010

Why 'java.lang.reflect.Type' Just Does Not Cut It as Complete Type Definition

1. Generics in Java: love and hate

Ever since Java 1.5 introduced generic types, Java developers have had strained relationship with them. On one hand, they are clearly a nice addition for static type safety of collection types; as well as make generic dispatching patterns (and fluent-style construction-by-copying-methods) possible. But on the other hand there are tricky issues introduced; mostly stemming from the infamous Type Erasure (see Java Generics FAQ if you are not familiar with it).

Generic types are especially problematic for framework and library developers. This is because although type erasure is not total -- Fields, Methods and Super-types have generic type information available from within class definitions (see "Super Type Tokens to Rescue!" for an explanation) -- available non-erased type information is offered in a nearly inedible form, as instances of "java.lang.reflect.Type" (which is implemented by Class.class, amongs other types).

2. Superficial issue: bad object hierarchy, modelling

The first obvious issue with Type is that it is not much more than a marker type, and exposes little in way of common functionality between implementations. So the very first thing one has to do is to upcast it to one of subtypes; and this suggests (rightly so) that object model is not very good. The reason for such awkward type hierarchy is probaby backwards-compatibility: as Java 1.5 had to bolt-in Type to be a supertype of Class, and Class had been extensively used by JDK, it may have been difficult to create any meaningful interface type to use.

But as awkward as it is to do instanceof's and upcasting, this is not the real big problem. There are some frameworks that try untangling traversal of this ugliness (like Kohsuke's Tiger Types); and coming up with a better type hierarchy is not particularly difficult.

3. The REAL problem: 'Type' only contains partial type definition

To illustrate the actual problem, let us consider following types:

  public abstract class Wrapper<T>
{ public T value;
}
public abstract class ListWrapper<E> extends Wrapper<List<T>> { } public class MyStringListWrapper extends ListWrapper<String> { }

Quick: what is type of field "value" of type MyStringListWrapper?

For seasoned Java veterans answer should come easy: it is of type "List<String>". For code that tries to determine type, an obvious procedure would be:

  1. Locate java.lang.reflect.Field representing field "value" from Wrapper.class
  2. Get its generic type using "field.getGenericType()"

Simple? Not so fast. What gets returned is an instance of Type; and more specifically and instance of java.lang.reflect.TypeVariable.
And what does TypeVariable give us? At most, upper and lower bounds (if we had "T extends B" or "T super S"), and... name. Bummer. Not much to go about at all.

The next obvious idea is to check out who declared the field (field.getDeclaringClass()), and see if we could somehow figure it out.

Turns out we can not: class "Wrapper.class" has no idea -- all it knows is that there is a type parameter T. Worse, while we can figure out super types (someClass.getGenericSuperType()), there isn't way to do the opposite as the class may be extended by multiple subtypes; and because thanks to Type Erasure, there will only ever be just one instance of any given class, no matter how many times it is extended with varying type parameters.

The real problem, then, is that we just do not have enough context to reliably resolve type parameters for given Methods, Fields or Constructors. In this case we would need "MyStringListWrapper.class"; from which point we could (with some work... non-trivial, but doable) unravel actual full type signature.

4. Solution: we need (more) context

From above it should be obvious that it is not enough to just hand a java.lang.reflect.Type value and expect it to tell the whole story. What is needed is context that represents classes, and more importantly, class-supertype relations where remainders of generic type information are hidden. Given this information it is possible -- although not trivially simple -- to reconstruct the full type definition of a member.

5. Detour: why do so many frameworks get this wrong?

Before presenting something better, I want to point out something interesting: most existing frameworks and APIs seem to operate under misunderstanding that it is enough to just pass java.lang.reflect.Type value and be done with it. JAX-RS, for example, is a really nice REST(-like) API (with good free implementations); but it passes serialization/deserialization values types as java.lang.reflect.Types (possibly together with Class that is not context but just type-erased equivalent of value; which does not help a lot with resolution).

I guess the idea may have been that perhaps one should have custom implementations of Type values (which is some work as there are no public default implementations) which can then contain information. This is theoretically possible, but very much impractical -- the gap between Type you get from Method, Field or Constructor is not enough, as you need to traverse type hierarchy; and THEN create custom implementations of GenericType... and then it just _might_ work.

But I digress; let's get back to solving the problem of properly resolving generic type information.

6. A library to handle generic type resolution: Java ClassMate

As part of implementing Jackson JSON processor, I had to solve the problem of resolving generic types of class members. It took until version 1.6 to get all (?) edge cases completely cracked, but at this point I think everything is working correctly, based on understanding the complex rat's nest of Java type information. Given this (and persistent requests from my fellow open source authors to write something like "generic Mr Bean" package), I figured that maybe I could actually write a good library that solves this problem, as well as some additional questions (yes -- there are plenty of additional problems needed when auto-discovering properties of POJOs -- but more on this on follow-up articles).

So: this is where my newest open source library -- Java ClassMate -- will hopefully make the world slightly less brutal place for Java framework developers.

To solve case I presented above, you would need to do 2 things:

  1. Resolve POJO type -- in this case, MyStringListWrapper -- to fully resolve type hierarchy (including type parameter bindings)
  2. Resolve class members, hierarchically.

Two-step processing was chosen (instead of beginning to end) for efficiency reasons, and because there are use cases where only first part is required (for example, to just find parameterization of generic types -- i.e. "I have this Map type; what is the key type?"

I will not go too deep into full functionality of the package: but here is piece of code needed to handle example case above:


  // First: need to resolve actual POJO type
TypeResolver typeResolver = new TypeResolver(); // TypeResolvers are thread-safe, reusable
ResolvedType pojoType = typeResolver.resolve(MyStringListWrapper.class);
// and then resolve members (fields, methods);
MemberResolver memberResolver = new MemberResolver(typeResolver); // likewise, reusable // for now, use default annotation settings (== ignore), overrides (none), filtering (none) ResolvedTypeWithMembers bean = memberResolver.resolve(mainType, null, null); // and then find field we are interested in
for (ResolvedField field : bean.getMemberFields()) {
if ("value".equals(field.getName()) {
ResolvedType fieldType = field.getType();
}
}

ResolvedType, in contrast to java.lang.reflect.Type, has fully resolved generic type parameter information; along with some other niceties such as optional aggregation of annotations (for example, methods can "inherit" annotations from overridden version of the method from super-class or interface).

In a way, ClassMate proposes a replacement of existing JDK type hierarchy, with methods that allow constructing property type information from available "raw" information. This includes not only ability to pass raw classes (in which case generic type MUST come from super-type definitions) but also programmatically constructing types (given raw class and generic type parameterization explicitly; or by using "GenericType" which uses "Super Type Token" pattern).

And this will actually be enough to figure out as much generic type information that there is to find, and write libraries that handle these types as expected; even when presented with advanced multi-level type parameterization.

7. Still There'll Be More

(but fear not, I will neither blacken your christmas, nor do anything to your door)

Since ClassMate is still in its pre-1.0 state, there are things left to complete; and maybe API can be simplified. But I would welcome all potential users to check it out at this point, since this would be perfect time to make sure use case you have is supported. I will try to write more about actual usage on my blog; ideas for things to write about and questions on how to handle a related use case would be most welcome.

Saturday, November 20, 2010

StaxMate 2.0.1 released; improved DOM-from-Stax, compatibility with default JDK 1.6 Stax implementation

Quick update from "XML world" -- in which I have spent much less time, due to explosive growth in JSON land: StaxMate 2.0.1 was just released.

1. StaxMate?

First question you might ask is "What the heck is StaxMate?". Fair enough -- given how little attention it has gotten, here is the main idea.

StaxMate is meant to offer "convenience of DOM with performance of Stax (or SAX)". Although Stax API was an improvement in usability for many use cases, it is still a rather low-level access API. StaxMate builds concept of "cursors" when reading content; and output context objects when writing content. Sample code and bit more in-depth explanation can be found from StaxMate Tutorial page; but basic idea is to offer better abstractions than simple flat event iterator. Sort of like how automatic transmission can simplify driving, compared to manual stick shift.

Working with cursors is typically similar to how DOM documents are traversed in simple top-down (recursive-descent) fashion: you start with root element, get child elements, locate more children, textual content and so forth. Same is done with StaxMate, with just one crucial limitation: all access must be done in document order (parent first, them children, in order they are in XML document). If you need to retain some information, you will do it explicitly (attribute values from parents need to be access before child elements, for example). StaxMate will take care to synchronize access when you use child cursors, so will never need to worry about skipping remaining siblings; you just can not access things in random order. Same is also true for output side; although there are ways to temporarily "freeze" output which does allow building content somewhat out-of-order, as necessary. This may be necessary for doing things like calculating parent attribute values based on content written for child elements.

The benefit of requiring access to be done in document order is that it means that there is no additional performance or memory overhead for keeping track of past content. Memory usage, therefore, is not very different from that of "raw" Stax parser or generator; same is true for performance. Overhead of DOM documents is often 3x - 5x that of streaming access; overhead of using StaxMate is typically in 10-20% range, sometimes even lower.

2. Fixes in 2.0.1

This patch release contains just 2 fixes, but both are quite important, so upgrade is strongly recommended.

First fix is to DOM-compatibility part (see "Reading DOM documents using Stax XML parser, StaxMate" for details on usage). It turns out that although building full DOM document worked fine with 2.0.0, there were issues if binding sub-trees; these issues should now be resolved.

Second fix is to interoperability with Stax parsers that do not implement Stax2 extension API (to date, Woodstox and Aalto do implement this, but not others; most notably, Sun Sjsxp which is the default Stax parser bundled with JDK 6). Although most operations work just fine, Typed Access accessors (getting XML element text as number, boolean value, enum) could cause state update to work incorrectly, leading to issues when accessing sequence of typed values. This has been resolved, by fixing the underlying problem in Stax2 API reference implementation library that StaxMate depends (version 3.0.4 of the library contains fixes).

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.