It has been a while, but it's now time to continue the overview of Typed
Access API, one of major features of Stax2 API version 3, implemented by
Woodstox.
The first
part of this mini-series dealt with "simple" values like integers
and booleans. So let's look at structured types that Typed Access API
supports. Selection is quite limited: only 4 fundamental types (int,
long, float, double) are directly supported, but perhaps most
interestingly there is also a way to easily extend this functionality to
parse custom types.
The contrived example to consider this time is that of a data set that
consists of large number of rows, each with large number of integers.
This could come from a spreadsheet full of sample data or such.
Traditionally you might think of storing it using format like:
<dataset>
<datarow>
<data>1</data>
<data>5</data>
<!-- and so on -->
</datarow>
</dataset>
But with Typed Access for arrays, you realize that you can actually make
it like this instead:
<dataset>
<datarow>1 5 <!-- and so on --> </datarow>
</dataset>
Which looks a bit better, and saves a byte or two in storage space as
well.
1. Reading numeric arrays
So how would we read such data? And, regarding this example, what should
we do with the data? Due to my limited skills in statistics, let's just
calculate 3 simple(st) aggregates available: minimum value, maximum
value, and total sum.
InputStream in = new FileInputStream("data.xml");
TypedXMLStreamReader sr = (TypedXMLStreamReader) XMLInputFactory.newInstance().createXMLStreamReader(in);
int min = Integer.MAX_VALUE, max = Integer.MIN_VALUE;
int total = 0;
sr.nextTag(); // dataset
int[] buffer = new int[20];
// let's loop over all <datarow> elements:
while (sr.nextTag() == XMLStreamConstants.START_ELEMENT) { // ends when we hit </dataset>
// loop to get all int values for the row
int count;
while ((count = sr.readElementAsIntArray(buffer, 0, buffer.length)) > 0) {
for (int i = 0; i < count; ++i) {
int sample = buffer[i];
total += sample;
min = Math.min(min, sample);
max = Math.max(max, sample);
}
// once there are no more samples, we'll be pointing to matching END_ELEMENT, as per javadocs
}
}
sr.close();
in.close();
// and there we have it
So far so good: we just need a buffer to read into, and we can read
numeric element content in. With attributes code is even simpler, since
the whole array would be returned with a single call (this because
attribute values are inherently non-streamable).
2. Writing numeric arrays
So where would we get this data? Ah, let me come up with something...
hmmh, why, yes, how about someone gave us a spreadsheet as a CSV
(comma-separated value) file? That'll work. So, given this file, we
could convert that into xml and... well, have some sample code to show.
Sweet!
BufferedReader r = new InputStreamReader(new FileInputStream("data.csv"), "UTF-8");
OutputStream out = new FileOutputStream("data.xml");
String line;
TypedXMLStreamWriter sw = (TypedXMLStreamWriter) XMLOutputFactory.newInstance().createXMLStreamWriter(out, "UTF-8");
sw.writeStartDocument();
sw.writeStartElement("dataset");
while ((line = r.readLine()) != null) {
sw.writeStartElement("datarow");
String[] tokens = line.split(","); // assume comma as separator
int[] values = new int[tokens.length];
for (int i = 0; i values.length; ++i) {
values[i] = Integer.parseInt(values[i]);
}
sw.writeIntArray(values, 0, values.length);
sw.writeEndElement();
}
sw.writeEndElement();
sw.writeEndDocument();
And there we have that, too. Simple? About the only additional thing
worth noting is that we could have done outputting of int arrays in
multiple steps too, if the incoming rows were very large. It is
perfectly fine to call sw.writeIntArray() multiple times
consecutively.
3. Reading arrays of custom types
And now let's consider the feature that might be the most interesting
aspect of Typed Access API array handling: ability to plug in custom
decoders. Just as with simple values (with which you can use
TypedXMLStreamReader.getElementAs(TypedValueDecoder)), there is
a specific method (TypedXMLStreamReader.readElementArrayAs(TypedArrayDecoder))
that acts as the extension point.
One possibility is to use one of existing simple value decoders (from
package org.codehaus.stax2.ri.typed; inner classes of
ValueDecoderFactory); this would allow implementing accessor for,
say, QName[] or boolean[]. But for simplicity, let's write our own
EnumSet decoder: decoder that can decode set of enumerated values into a
container; for example, colors using their canonical names. We'll do it
like so:
class ColorDecoder
extends TypedArrayDecoder
{
public enum Color { FOO, BAR, OTHER };
EnumSet<Color> colors;
public boolean decodeValue(char[] buffer, int start, int end) {
return decodeValue(new String(buffer, start, end-start));
}
public boolean decodeValue(String input) {
// would also be very easy to call a standard TypedValueDecoder here
colors.add(Color.valueOf(input));
}
public int getCount() { return colors.size(); }
public boolean hasRoom() { return true; } // never full
// Note: needed, but not part of TypedArrayDecoder
EnumSet<Color> getColors() { return colors; }
}
And to use it, we would just do something like:
TypedXMLStreamReader sr = ...;
ColorDecoder dec = new ColorDecoder();
sr.readElementAsArray(dec);
EnumSet<Color> colors = dec.getColors();
And obviously one can easily create sets of commonly needed decoders to
essentially create semi-automated xml data binding libraries.
4. Benefits of Array Access using Typed Access API
Now that we know what can be done and how, it is worth considering one
important question: why? What are the benefits of using Typed Access
API, over alternatives like get-element-value-parse-yourself?
Consider
following:
-
Allows use of more compact representation: space-separated values,
instead of wrapping each individual value in an element.
-
Faster, not only due to compactness (which in itself helps a lot), but
also due to more optimal access Woodstox gives to raw data.
-
Lower memory usage for large data sets: since array access is chunked,
memory usage is only relative to size of chunks. You can handle
gigabyte sized data files with modest memory usage -- something no
other standard API (or, for that matter, any non-standard API I am
aware of) on Java platform allows!
-
More readable xml: compact representation generally improves
readability.
-
With pluggable decoders can build simple reusable datatype libraries,
while still adding very little processing overhead
And these are just benefits compared to other Stax-based approaches.
Benefits over, say, accessing data via DOM trees (*) are significantly
higher.
(*) although note that you can actually use Stax2 Typed Access API on
DOM trees, by constructing TypedXMLStreamReader from DOMSource, using
Woodstox 4!