Wednesday, February 04, 2009

Typed Access API tutorial, part II: arrays

It has been a while, but it's now time to continue the overview of Typed Access API, one of major features of Stax2 API version 3, implemented by Woodstox.

The first part of this mini-series dealt with "simple" values like integers and booleans. So let's look at structured types that Typed Access API supports. Selection is quite limited: only 4 fundamental types (int, long, float, double) are directly supported, but perhaps most interestingly there is also a way to easily extend this functionality to parse custom types.

The contrived example to consider this time is that of a data set that consists of large number of rows, each with large number of integers. This could come from a spreadsheet full of sample data or such. Traditionally you might think of storing it using format like:

  <!-- and so on -->

But with Typed Access for arrays, you realize that you can actually make it like this instead:

 <datarow>1 5 <!-- and so on --> </datarow>

Which looks a bit better, and saves a byte or two in storage space as well.

1. Reading numeric arrays

So how would we read such data? And, regarding this example, what should we do with the data? Due to my limited skills in statistics, let's just calculate 3 simple(st) aggregates available: minimum value, maximum value, and total sum.

InputStream in = new FileInputStream("data.xml");
TypedXMLStreamReader sr = (TypedXMLStreamReader) XMLInputFactory.newInstance().createXMLStreamReader(in);
int min = Integer.MAX_VALUE, max = Integer.MIN_VALUE;
int total = 0;
sr.nextTag(); // dataset
int[] buffer = new int[20];
// let's loop over all <datarow> elements: while (sr.nextTag() == XMLStreamConstants.START_ELEMENT) { // ends when we hit </dataset> // loop to get all int values for the row
int count;
while ((count = sr.readElementAsIntArray(buffer, 0, buffer.length)) > 0) {
for (int i = 0; i < count; ++i) {
int sample = buffer[i];
total += sample;
min = Math.min(min, sample);
max = Math.max(max, sample);
} // once there are no more samples, we'll be pointing to matching END_ELEMENT, as per javadocs
} sr.close(); in.close(); // and there we have it

So far so good: we just need a buffer to read into, and we can read numeric element content in. With attributes code is even simpler, since the whole array would be returned with a single call (this because attribute values are inherently non-streamable).

2. Writing numeric arrays

So where would we get this data? Ah, let me come up with something... hmmh, why, yes, how about someone gave us a spreadsheet as a CSV (comma-separated value) file? That'll work. So, given this file, we could convert that into xml and... well, have some sample code to show. Sweet!

  BufferedReader r = new InputStreamReader(new FileInputStream("data.csv"), "UTF-8");
  OutputStream out = new FileOutputStream("data.xml");
  String line;

  TypedXMLStreamWriter sw = (TypedXMLStreamWriter) XMLOutputFactory.newInstance().createXMLStreamWriter(out, "UTF-8");

  while ((line = r.readLine()) != null) {
    String[] tokens = line.split(","); // assume comma as separator
    int[] values = new int[tokens.length];
    for (int i = 0; i  values.length; ++i) {
      values[i] = Integer.parseInt(values[i]);
    sw.writeIntArray(values, 0, values.length);

And there we have that, too. Simple? About the only additional thing worth noting is that we could have done outputting of int arrays in multiple steps too, if the incoming rows were very large. It is perfectly fine to call sw.writeIntArray() multiple times consecutively.

3. Reading arrays of custom types

And now let's consider the feature that might be the most interesting aspect of Typed Access API array handling: ability to plug in custom decoders. Just as with simple values (with which you can use TypedXMLStreamReader.getElementAs(TypedValueDecoder)), there is a specific method (TypedXMLStreamReader.readElementArrayAs(TypedArrayDecoder)) that acts as the extension point.

One possibility is to use one of existing simple value decoders (from package org.codehaus.stax2.ri.typed; inner classes of ValueDecoderFactory); this would allow implementing accessor for, say, QName[] or boolean[]. But for simplicity, let's write our own EnumSet decoder: decoder that can decode set of enumerated values into a container; for example, colors using their canonical names. We'll do it like so:

class ColorDecoder
  extends TypedArrayDecoder
  public enum Color { FOO, BAR, OTHER };
  EnumSet<Color> colors;
  public boolean decodeValue(char[] buffer, int start, int end) {
    return decodeValue(new String(buffer, start, end-start));
  public boolean decodeValue(String input) {
    // would also be very easy to call a standard TypedValueDecoder here
  public int getCount() { return colors.size(); }
  public boolean hasRoom() { return true; } // never full

  // Note: needed, but not part of TypedArrayDecoder
  EnumSet<Color> getColors() { return colors; }

And to use it, we would just do something like:

  TypedXMLStreamReader sr = ...;
  ColorDecoder dec = new ColorDecoder();
  EnumSet<Color> colors = dec.getColors();

And obviously one can easily create sets of commonly needed decoders to essentially create semi-automated xml data binding libraries.

4. Benefits of Array Access using Typed Access API

Now that we know what can be done and how, it is worth considering one important question: why? What are the benefits of using Typed Access API, over alternatives like get-element-value-parse-yourself?
Consider following:

  • Allows use of more compact representation: space-separated values, instead of wrapping each individual value in an element.
  • Faster, not only due to compactness (which in itself helps a lot), but also due to more optimal access Woodstox gives to raw data.
  • Lower memory usage for large data sets: since array access is chunked, memory usage is only relative to size of chunks. You can handle gigabyte sized data files with modest memory usage -- something no other standard API (or, for that matter, any non-standard API I am aware of) on Java platform allows!
  • More readable xml: compact representation generally improves readability.
  • With pluggable decoders can build simple reusable datatype libraries, while still adding very little processing overhead

And these are just benefits compared to other Stax-based approaches. Benefits over, say, accessing data via DOM trees (*) are significantly higher.

(*) although note that you can actually use Stax2 Typed Access API on DOM trees, by constructing TypedXMLStreamReader from DOMSource, using Woodstox 4!

blog comments powered by Disqus

Sponsored By

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me
Check my profile to learn more.