February 2009

Working with annotations: Scannotations?

Here's a project (from a comment to the earlier annotation-related entry) that looks pretty interesting: Scannotation. Ability to automagically find out things that are annotated, with zero configuration (plug and play), sounds pretty cool to me.

Posted by Tatu Saloranta at Saturday, February 28, 2009 8:31 PM
Categories:
| Permalink |Comments | links to this post

On musical influences: Roy Wood, one of the Greats (and Supertramp too)

Another important question (besides the "what's in it for you" one, regarding Open Source) is the "what are your biggest musical influences" one. I guess it is somewhat more common to ask this question from actual musicians, so I have not had a chance to answer this question in detail so far.
But never fear: I now choose to partially and pre-emptively answer this question, here and now.

Since the full list of notable influences would get rather long, so we can start by filtering out most of irrelevant ear sore.
Sample query like this will get us started:

SELECT * FROM all_available WHERE time_period BETWEEN late-60s and mid-70s AND genre in (pop, rock)

That is pretty broad and will still give us lots of irrelevant crap (and perhaps filter couple of big ones, like GnR and AC/DC -- notable if not fatal loss). But it does reduce the range a bit, and within this scope, here are 2 recent delightful discoveries (and purchases) of mine:

Roy Wood's (see here too) first (and best?) solo album: Boulder
Crisis? What crisis? by Supertramp

Both Roy & Supertramp have been my favorites for quite a while: Roy originally due to being the genius behind The Move (and also big part of ELO's first one; Jeff's one of the Greats too). And Supertramp is such a show-off group of technical perfectionists that "Breakfast in America" (their most widely-known and sold album) CD is pretty much worn off by now too.

Regarding Boulder: it is sad that its release was delayed by so much that its sales were much weaker: had it been released in -69, it might have been in top-10, at least in UK. But by -73, soundscape had change quite a bit and I suspect sales were tepid. But the record itself is a delightful and goofy venture: totally enjoyable even now. And it is strange how a song like "Miss Clarke and the Computer" sounds less dated than most similar takes from early 80s (including ELO's "Time" -- great record, rather contrived lyrics tho, but I guess that is trademark ELO too -- but I digress). I don't know if Boulder is actually my all-time favorite by Roy: that might actually be from The Move, their first or last album. Singles like "Omnibus", "Wild Tiger Woman", "Fire Brigade", "Brontosaurus", "No Time" are off the hook, and list just goes on and on. But Boulder is a very good album at any rate, even if not a hit collection.

And then "the Crisis" (hey, maybe they were referring to Mike Oldfield's Crises? Nah, that wasn't quite yet released -- yet another very good record there, released a few years afer this one). Well: it only took 2 listenings to find out that I really like the record. As of yet, I still slightly prefer "the Breakfast". But that might change. First 5 songs are very close to perfect, in their own class of very polished rock, including surprisingly good lyrics. They really might be the only contemporary group to be considered similar to Steely Dan (not surprisingly, I like Dans too -- albeit only having their 2-CD collection).

Anyhoo: thought I'd share this. More to come, over time. Groove on!

Posted by Tatu Saloranta at Saturday, February 28, 2009 7:15 PM
Categories: Music, Silly
| Permalink |Comments | links to this post

Frustrations with Java Annotations

(note: minor update on 28-Feb-2009, regarding generics)

Lately I have been working intensively with code that deals with Java annotations. That is, not just annotating classes for use by other packages, but writing code that reads such annotations. This is within context of Jackson JSON-processor; latest versions can use annotations to configure details of how Object serialization and deserialization (data binding) work.

I have learnt to appreciate many things with annotations. They work kind of nicely, and the ability to use meta-annotations makes it possible to even build reusable generic annotation processors (although I am not aware of much work in this area -- but I haven't been looking actively). Also, although limitation to literal types (primitives, Strings, Enums, Class literals, arrays of the same) limits things a bit, it is an understandable restriction.

But I have also found couple of major frustrations with annotations. So here's the airing of grievances (yeah, I know, it's not yet even festivus...):

1. Inability to extend annotations

When creating sets of annotations for a library, it would be useful to be able to create "base annotations". Why? For example to:

check if a given Annotation type extends the base annotation, to easily recognize "my annotations" (if (annotation instanceof MyBaseAnnotation.class) ..."
define common set of base properties for related annotations, including default values for some

It's worth noting that lack of sub-classing is also frustrating with Enums, which have some similarities to annotation type.

I suspect there may be solid reasons for not allowing sub-classing. But it is unfortunate at any rate.

2. Inability to use null as value

Perhaps more importantly, there is no way to indicate "none" for any value arguments (String, enums or Class value) -- you can't do something like:

  @UseClass(null) // illegal, won't compile

(and ditto for default values). So to denote such a missing or non-applicable value, you need to provide a dummy value like "NoClass.class". Yuck. And with Strings you are just out of luck. This is unfortunate, and I don't know why such a limitation is necessary. It would seem logical to allow nulls for values, including default values.

3. Lack of Generics (with Class arguments)

This is not really something annotations fuctionality itself can do much about: but I will list it nonetheless. It is a shame that one can not add typing to Classes. I mean, it's quite different to specify:

HashMap.class

than would something like this:

HashMap<String,Integer>.class

(if that was legal, which it isn't). This of course ties back to the "Why Java Generics Suck" entry from a while ago.

Posted by Tatu Saloranta at Tuesday, February 24, 2009 9:00 PM
Categories: Java
| Permalink |Comments | links to this post

Links Galore: Bayesian classification, Nifty new language features for Java

Here are some miscellaneous interesting links for Java.

1. Core Java

I have not paid close attention to the development of Java core language or JDK since, well, probably 1.1. It's nothing ideological, nothing against sneak peeks, but in general I can wait until things get fully baked. Most music I listen is still from late 60s and early 70s; I have scarcely seen a movie released less than 2 years ago (I'll just Netflix highly-rated interesting looking ones over time). You get the idea.

But I couldn't help but nice links to 2 interesting things happening right now:

Project coin seems like a set of nifty improvements to the core language. I like the idea of small-but-nice usability improvements -- they tend to be overlooked in favor of bigger, bolder and riskier undertakings (besides, these kinds of things often complement each other, why choose?)
Latest JDK 1.6 update seem to pack surprisingly sophisticated things: I am positively surprised by inclusion of the mythical "Garbage First" collector making it into a stable JDK

2. Simple IR libs

Something that seems immediately useful is a general purpose library for doing basic Bayesian classifications: and CI-Bayes (java version of python stuff from "Programming Collective Intelligence" book if I understood it right) seems to be just the thing. About the only concern I have is their use of Javolution (seems to be used for "fast" hash maps -- I have my doubts), but that's just my personal pet peeve.

Posted by Tatu Saloranta at Monday, February 23, 2009 7:34 PM
Categories: Java
| Permalink |Comments | links to this post

Update I on Update of Json-parsing performance

After writing the entry about parsing performance measurements, I got feedback leading to bit more comple test. Specifically, one of packages (json.simple) actually does offer streaming API as well. So I ended up adding one more test case. Turns out that the package in question gets some measurable boost from this (throughput +15-25%), see the full updated results. And here's the "quick pic" as well"

Performance Graph

Also, one thing the original entry did not cover was how to interpret the results. Here's a brief summary:

'results' marked with 'KB' just indicate size of the parsed document (same for all parsers)
actual results are in 'tps' (transactions per second), and "bigger is better": transaction here is a single parse through the doc and accumulation of field counts.

(and for more, you may want to check out how Japex works in general).

Hope this helps.

Posted by Tatu Saloranta at Sunday, February 22, 2009 1:13 PM
Categories: Java, JSON, Performance
| Permalink |Comments | links to this post

Inter-generational delights: the Mom Song

Most people live their lives sandwiched between two groups of very dear and immensely frustrating individuals: ingrateful brats known as offsprings, and praise-needy all-knowing old farts known as parents. It would sometimes be nice to just connect the two, and avoid being the middle-man for a while. They could then discuss all our flaws, without us being involved. No wonder grandparents and their grandchildren often get along pretty nicely.

But sometimes star align in surprisingly pleasant ways, and there are these inter-generational bright spots in life. One such thing is the by-now passe (I assume -- I always learn these things very late, so I can assume it has passed by a while ago; apparently almost exactly 1 year ago now) "Mom Song". Here's a decent version (as in sung by a professional, using original lyrics) link to which I was sent by my mom. Hilarious, and painfully true. As an added bonus, I did learn a few useful phrases to use with my kids too (or in some cases idiomatic translations of phrases I was familiar with, but not in english -- my favorite must be "were you born in a barn, would you like some hay").

Anyhow, that link made my day, as well as that of my children. They like it too. But probably at bit different level. They will get the deeply painful (and yet more satisfactory) meaning in a few if and when they multiply.

Posted by Tatu Saloranta at Sunday, February 22, 2009 12:57 PM
Categories: Philosophic, Silly
| Permalink |Comments | links to this post

Thought of the day: value of truth telling

This just struck me: Telling the truth is a very underappreciated craft. And a singularly useful one too. Feel free to quote.

That should be obvious but apparently isn't. When honed to perfection, and applied with straight face, it can also be used with surgical precision, effectively and elegantly.

Why people waste more time on telling half-assed lies is beyond me -- liars need good memory, and excellent mental project management capabilities (which are in short order for most people anyway); and yet lying generally produces no better results than telling the truth, even when one does not get caught. And very few people get addicted to telling the truth, to the degree they couldn't stop doing it. Quite the reverse is true for lying and dishonesty.

Posted by Tatu Saloranta at Thursday, February 19, 2009 11:36 PM
Categories: Philosophic
| Permalink |Comments | links to this post

Update on State of Json-parsing Performance

(22-Feb-2009, NOTE: there is an update to this update with even more up-to-date results!)

It has been good year and a half since I blogged about Json performance ("More on JSON performance in Java (or lack thereof)" ).
So it is about time to revisit the question and see what is the state of the art with Java Json processing today.

This time I will be using a bit more full-featured performance benchmark framework: (codename "StaxBind"), which is based on Japex, and allows for easy comparison of different data format / library combination for different tasks. Initially aimed at comparing data binding performance for xml processing (hence the name), it is growing for a more general purpose data format processing performance testing framework.
For now the module is available from Woodstox repository, and contains a few test cases including one used here.

Since benchmarks are run using Japex the results should be more informative as well as reproduceable; plus, we get some pretty graphs to look at.

1. Test case: "json-field-count"

The specific test used from StaxBind is "json-count", test designed to allow testing a wide selection of available Java Json parsers. Test code essentially traverses through given Json documents, counting instances of field names; results are verified before each test to ensure that all parsers (or rather, test drivers for parsers) see the same data.
This traversal operation is not an overly meaningful in itself, but it is easy to implement for most parsers (see below for exceptions), and should be reasonably fair and representative regarding expected processing performance. The other more obvious choice would be a data binding test -- I hope to cover that later on -- but that will mean writing much more test code for packages that do not support automatic data binding.

2. Sample documents

For testing I chose 3 different Json documents:

Sample #4 from json.org example docs section (the one with "web-app" struct)
Sample Twitter search response message
"db100.xml" (from XMLTest document set) automatically converted to json

Document sizes vary from 3 to 15 kB; fairly small, but enough to show the trend about parsing performance. This is not a great set of documents to use, but since there is generally accepted set of Json test documents available (or if there is, please let me know!), it will have to do.

3. Parsers compared

I decided to choose parsers to test from json.org's java parser implementation list. I think that is the most likely starting point for developers; and it is reasonably complete list as well.

Not all listed libraries from the list qualify. Specifically:

Some libraries included use another Json parser: for example, both XStream and Jettison use the "json.org" reference implementation as the underlying parser
Some libraries can only generate Json, not parse it (such as flex-json
One otherwise decent-looking candidate (Google-gson) only implements data-bin ding interface, which is which might may be a decent Json processing package only seems to implement data binding functionality, but not streaming or tree-based alternative (I am hoping to include it in the data-binding tests)

Given this, here are the contestants:

Json.org reference implementation: the "standard" choice most developers start with
Json Tools from Berlios (full-featured, well-documented)
Json-lib (another fairly full-featured package)
Json-simple from Google code
StringTree JSON (delightfully compact code; alas very simplistic regarding well-formedness checks)
Jackson (the reigning champion from the last test), version 0.9.8

Of these, all implement a tree-model; some also implement data binding (json-tools and Jackson at least), but only Jackson appears to implement pure streaming interface. For this reason, there are 2 tests for Jackson: one using Tree model, the other streaming API.

4. Results

After letting Japex churn through the test for almost an hour, we get the actual results. Full result data and graphs can be found here. (also: here are result from another test run (this one with a more modern dual-core system). Both runs are on a Linux desktop machine, using a recent JVM (1.6.0 update 10 or 12)

But here is the main graph (from the first test run) that summarizes results:

Performance Graph

It looks like Jackson is still rather more efficient at parsing than the rest: not only is the core streaming parser very fast, even the tree-based alternative does quite well. In fact, graph readability suffers a little bit from Jackson's dominance.

As for the rest, StringTree parser performs a bit better than the others. But the biggest surprise may be the fact the reference implementation is faster than most alternatives; despite the claims made for these alternatives (I have yet to find a library that doesn't claim to be light-weight and fast :) ). In a way that's good -- at least most developers are not using the slowest available parsers.

Posted by Tatu Saloranta at Tuesday, February 17, 2009 11:39 PM
Categories: Java, JSON, Performance
| Permalink |Comments | links to this post

Vaccine against Stupidity, anyone?

One of more depressing trends of lately has been the constant flood of fabricated "controversies": cases where ignorant and misguided yet fiercely determined people have been able to get media to represent "both sides of the issue", and causing countless shit storms of misinformation, which just further spreads stupidity. Kind of how virus infections like common cold spread, come to think of it.

List of such topics is long, including things like "evolution controversy", "greenhouse effect ain't", "scientology is victim of anti-religious governments", "solar-power is blocked by powerful oil producers" and so forth. These things seem to be escepially prone to occur whenever there is even a hint of the Big Conspiracy; usually by a large loosely defined international entity of some sort; anything from the United Nations to Big Pharmaceuticals and Cabal of Climatologist seem to qualifiy. But perhaps the most worrisome is the still on-going perverse tale of "vaccinaction deniers". You know, those misguided and harmful parents (and occasionally, celebrities) that claim that vaccinations should be stopped, based on scant evidence and strong conviction. The most common allegation against vaccination is the missing link known as "vaccinations cause autism"; allegation completely unsupported by actual facts and research. And obviously this all is caused

There are plenty of good (and scary) articles on the subject, but the latest one to catch my attention was from "That's Fucking Stupid" web site (love the name btw). So not only are some american celebrities infected with contagious stupidity, but so are some of them english folks too. And the article also refers bunch of stuff from Bad Science, another worthy web site.

While the immediate concern is the health of children of these sorry individuals, an even more relevant question is the fate of the people who do take vaccinations (and their kin). That is, the rest of us. Ultimately we may pay the biggest price, due to loss of "the Herd Immunity". To simplify, when big enough slice of population has immunity, those without immunity will still get decent protection; and conversely, if there is not enough overall protection, ones without protection are even more prone to get the disease. It boils down to basic statistics and epidomological spread of diseaser, but has been well-established over time: with high enough vaccination coverage (coupled with effective treatments for those who do catch the disease), even not-so-effective vaccines work well. So: thanks to Jenny M and other numbskulls, my family becomes bit less protected against potentially lethal diseases.

This all makes me wonder we will ever get Cure for the malaise known as Common Stupidity.

Before that happens, best we can do is to learn from books like "What Intelligence Test Miss: the Psychology of Rational Thought" (another book I found via SciAm Mind). Here's hoping for the second coming of the Age of Reason.

Posted by Tatu Saloranta at Saturday, February 14, 2009 2:29 PM
Categories: Philosophic
| Permalink |Comments | links to this post

Tutorials for Two: StaxMate, Jackson

This took way longer than it should have, but after a wait spanning couple of years, let's hear it for:

5 minute tutorial for StaxMate XML helper library
3 minute tutorial for Jackson Json-processor (hey, it's still partially undone, maybe it gets upgraded to 5 minute one?)

I hope that these can attract even more developers to have a good look at these 2 awesome libraries.

Also: any help in writing more tutorial -- especially in form of How-to articles -- would be acutely beneficial and appreciated.

Posted by Tatu Saloranta at Friday, February 13, 2009 8:44 PM
Categories: JSON, StaxMate, XML/Stax
| Permalink |Comments | links to this post

Religion vs Wealth: Less Allah, more Moolah

(and similarly for Jesus Christ & other superstars of superstitious nature)

Ah-ha! Just as I suspected: religion does not pay. Not only does secularity rhyme with prosperity, it is also strongly correlated. And conversely, piss-poor places tend to have more faith in divine beings and fabrications.

Posted by Tatu Saloranta at Friday, February 13, 2009 7:11 PM
Categories: Philosophic
| Permalink |Comments | links to this post

All Work and No Play makes Jack a .... serial killer?

Once again, one of my favorite magazines, Scientific America (and the specific "flavor", SA Mind) had a very interesting article: "the Serious Need for Play". It is fascinating to know that not only is playing good for you, but that it is essential and necessary for normal social (and perhaps intellectual) development. And especially the "free play" variety, where children make their own rules: too many parents are Gung Ho about "developmentally beneficial" games, activities -- sometimes not much more than organized preparation for school -- that their children barely really play at all. It's not all that playful if you just follow rules set up by somebody else.

The link to serial killers is mentioned, too; that most of the worst serial killers shared just 2 things: abuse during childhood, and lack of playing as children. And while it is hard to know what is the nature of correlation (is it just psychologically skewed individual don't play?), it is a chilling to think of all the consequences of parental actions, too strict disciplinary actions and abuse by absence and/or inattention.

There are many other interesting tidbits sprinkled in there: I didn't know, for example, that children use more complicated language when communicating with their peers, than with adults. When you think about it, it does make sense -- adults "help" by understanding this more easily, basically not demanding children to stretch their verbal capabilities -- but it's still not entirely intuitive.

Plus one more rather surprising finding: that "rough housing" (mock fights, rough playing, more common with boys) actually helps with development (or, correlates with it at any rate) of social skills. That seems very surprising, given the stereotypical image of naughty boys, and well-intentioned but misguided efforts by caretakers to reign in all such play.

Anyway: one more interesting article. I just need more time to read all these articles. :-)

Posted by Tatu Saloranta at Monday, February 09, 2009 8:37 PM
Categories: Philosophic
| Permalink |Comments | links to this post

Fast Object Serialization with Jackson (json), Aalto (xml)

More interesting performance benchmarking:

http://technotes.blogs.sapo.pt/1708.html

Looks like Aalto is not the only thing that serializes objects to text very fast: Jackson does some ultra-sonic processing as well!
It will be interesting to see how the other side (deserialization, ie. parsing) performs: my experiences suggest that this is where Jackson+json really shines (as well as Aalto, relative to sjsxp). But it is good to know that even serialization is faster with these 2 libraries than with any other textual alternative, bar none.

Posted by Tatu Saloranta at Monday, February 09, 2009 7:24 PM
Categories: Java, JSON, Performance, XML/Stax
| Permalink |Comments | links to this post

Improving XSLT performance, Saxon way

Saxon is a very good xslt-processor (and xquery, schema, xpath etc), and a rather efficient one too. But there are always ways it gets improved; and best of all, many of these efforts are very well documented by its author Michael Kay (a true "rock star programmer" in my books. Here is the latest example of this interesting work. It is really fascinating to follow the progress, and also gives one more appreciation of complexity of such machinery.

Posted by Tatu Saloranta at Sunday, February 08, 2009 10:38 PM
Categories:
| Permalink |Comments | links to this post

Now there's a sound business plan...

Gotta love the Onion, so funny, so insightul:

"Cheney Dunk Tank Raises $800 Billion For Nation"

... if only!

Posted by Tatu Saloranta at Thursday, February 05, 2009 10:50 PM
Categories:
| Permalink |Comments | links to this post

Typed Access API tutorial, part II: arrays

It has been a while, but it's now time to continue the overview of Typed Access API, one of major features of Stax2 API version 3, implemented by Woodstox.

The first part of this mini-series dealt with "simple" values like integers and booleans. So let's look at structured types that Typed Access API supports. Selection is quite limited: only 4 fundamental types (int, long, float, double) are directly supported, but perhaps most interestingly there is also a way to easily extend this functionality to parse custom types.

The contrived example to consider this time is that of a data set that consists of large number of rows, each with large number of integers. This could come from a spreadsheet full of sample data or such. Traditionally you might think of storing it using format like:

<dataset>
 <datarow>
  <data>1</data>
  <data>5</data>
  <!-- and so on -->
 </datarow>
</dataset>

But with Typed Access for arrays, you realize that you can actually make it like this instead:

<dataset>
 <datarow>1 5 <!-- and so on --> </datarow>
</dataset>

Which looks a bit better, and saves a byte or two in storage space as well.

1. Reading numeric arrays

So how would we read such data? And, regarding this example, what should we do with the data? Due to my limited skills in statistics, let's just calculate 3 simple(st) aggregates available: minimum value, maximum value, and total sum.

InputStream in = new FileInputStream("data.xml");
TypedXMLStreamReader sr = (TypedXMLStreamReader) XMLInputFactory.newInstance().createXMLStreamReader(in);
int min = Integer.MAX_VALUE, max = Integer.MIN_VALUE;
int total = 0;
sr.nextTag(); // dataset
int[] buffer = new int[20];
// let's loop over all <datarow> elements:
while (sr.nextTag() == XMLStreamConstants.START_ELEMENT) { // ends when we hit </dataset>
// loop to get all int values for the row
  int count;
  while ((count = sr.readElementAsIntArray(buffer, 0, buffer.length)) > 0) {
    for (int i = 0; i < count; ++i) {
      int sample = buffer[i];
      total += sample;
      min = Math.min(min, sample);
      max = Math.max(max, sample);
    }
// once there are no more samples, we'll be pointing to matching END_ELEMENT, as per javadocs
  }
}
sr.close();
in.close();
// and there we have it

So far so good: we just need a buffer to read into, and we can read numeric element content in. With attributes code is even simpler, since the whole array would be returned with a single call (this because attribute values are inherently non-streamable).

2. Writing numeric arrays

So where would we get this data? Ah, let me come up with something... hmmh, why, yes, how about someone gave us a spreadsheet as a CSV (comma-separated value) file? That'll work. So, given this file, we could convert that into xml and... well, have some sample code to show. Sweet!

  BufferedReader r = new InputStreamReader(new FileInputStream("data.csv"), "UTF-8");
  OutputStream out = new FileOutputStream("data.xml");
  String line;

  TypedXMLStreamWriter sw = (TypedXMLStreamWriter) XMLOutputFactory.newInstance().createXMLStreamWriter(out, "UTF-8");
  sw.writeStartDocument();
  sw.writeStartElement("dataset");

  while ((line = r.readLine()) != null) {
    sw.writeStartElement("datarow");
    String[] tokens = line.split(","); // assume comma as separator
    int[] values = new int[tokens.length];
    for (int i = 0; i  values.length; ++i) {
      values[i] = Integer.parseInt(values[i]);
    }
    sw.writeIntArray(values, 0, values.length);
    sw.writeEndElement();
  }
  sw.writeEndElement();
  sw.writeEndDocument();

And there we have that, too. Simple? About the only additional thing worth noting is that we could have done outputting of int arrays in multiple steps too, if the incoming rows were very large. It is perfectly fine to call sw.writeIntArray() multiple times consecutively.

3. Reading arrays of custom types

And now let's consider the feature that might be the most interesting aspect of Typed Access API array handling: ability to plug in custom decoders. Just as with simple values (with which you can use TypedXMLStreamReader.getElementAs(TypedValueDecoder)), there is a specific method (TypedXMLStreamReader.readElementArrayAs(TypedArrayDecoder)) that acts as the extension point.

One possibility is to use one of existing simple value decoders (from package org.codehaus.stax2.ri.typed; inner classes of ValueDecoderFactory); this would allow implementing accessor for, say, QName[] or boolean[]. But for simplicity, let's write our own EnumSet decoder: decoder that can decode set of enumerated values into a container; for example, colors using their canonical names. We'll do it like so:

class ColorDecoder
  extends TypedArrayDecoder
{
  public enum Color { FOO, BAR, OTHER };
  
  EnumSet<Color> colors;
  
  public boolean decodeValue(char[] buffer, int start, int end) {
    return decodeValue(new String(buffer, start, end-start));
  } 
  public boolean decodeValue(String input) {
    // would also be very easy to call a standard TypedValueDecoder here
    colors.add(Color.valueOf(input));
  }
  public int getCount() { return colors.size(); }
  public boolean hasRoom() { return true; } // never full

  // Note: needed, but not part of TypedArrayDecoder
  EnumSet<Color> getColors() { return colors; }
}

And to use it, we would just do something like:

  TypedXMLStreamReader sr = ...;
  ColorDecoder dec = new ColorDecoder();
  sr.readElementAsArray(dec);
  EnumSet<Color> colors = dec.getColors();

And obviously one can easily create sets of commonly needed decoders to essentially create semi-automated xml data binding libraries.

4. Benefits of Array Access using Typed Access API

Now that we know what can be done and how, it is worth considering one important question: why? What are the benefits of using Typed Access API, over alternatives like get-element-value-parse-yourself?
Consider following:

Allows use of more compact representation: space-separated values, instead of wrapping each individual value in an element.
Faster, not only due to compactness (which in itself helps a lot), but also due to more optimal access Woodstox gives to raw data.
Lower memory usage for large data sets: since array access is chunked, memory usage is only relative to size of chunks. You can handle gigabyte sized data files with modest memory usage -- something no other standard API (or, for that matter, any non-standard API I am aware of) on Java platform allows!
More readable xml: compact representation generally improves readability.
With pluggable decoders can build simple reusable datatype libraries, while still adding very little processing overhead

And these are just benefits compared to other Stax-based approaches. Benefits over, say, accessing data via DOM trees (*) are significantly higher.

(*) although note that you can actually use Stax2 Typed Access API on DOM trees, by constructing TypedXMLStreamReader from DOMSource, using Woodstox 4!

Posted by Tatu Saloranta at Wednesday, February 04, 2009 9:36 PM
Categories: XML/Stax
| Permalink |Comments | links to this post

Akuma Matata

(does that mean "don't worry" in proto-urdu? probably not)

One more interesting project that mr. Kawaguchi kindly pointed us to is Akuma. It seems like a useful tube of gorilla glue, to gel Java code bit more closely (and correctly) with native crap that one may have to run. It also appears useful for all kinds of launchers, nanny scripts, "process managers", and other wrappers; allowing one to write these in Java, or even avoid having to write wrappers at all.

Posted by Tatu Saloranta at Monday, February 02, 2009 10:36 PM
Categories: Java
| Permalink |Comments | links to this post

Json processing with Jackson: Which method to use?

Now that we have reviewed all the methods (specifically, Stream-of-Events, Data Binding and Tree Model), the remaining question is: which one is right for me?
There are no hard and fast rules, but here are couple of perspectives that may be useful in sorting it all out.

1. If all else fails, bind to objects

My personal favorite is the data binding, and barring specific reasons to contrary, data binding seems like a good default choice. If it doesn't work, other alternatives will still be available; but usually it "Just Works". And I like that.

2. For extreme efficiency, use Stream-of-Events

For developers with extremely frugal resource constraints (memory, cpu), nothing beats the "raw" streaming API. It uses minimum amount of memory, does not construct any more objects than are absolutely needed, and does this very fast. There's no scroogier way to deal with Json. It is not the most convenient way, but it is efficient and flexible. You just have to be a DIY handy man; and the interface is WYWIWYG (What You Write Is What You Get). No syntactic sugar, just a bit more monkey code.

3. For write-only use cases, Stream-of-Events does just fine

Given how simple it is generate (write) Json content, there are often no particular reasons to use anything more complicated than a JsonGenerator. That is, when input does not come from a Json source (if it did, other alternatives might be more tempting).

In fact, considering that if you don't start with any Json-based construct, streaming API may be even simpler than Tree Model based one or data binding: after all, to use the Tree Model you must build a tree; and for data binding have to construct objects to convert to Json. That is just overhead if the only purpose is to generate Json.

4. For flexible traversal, try the Tree Model

If you have to do lots of traversal, especailly using dynamic rules, and possibly incomplete and/or non-regular content, Tree Model is often often the best choice.

5. Tree Model works well for prototyping

As a special case for dynamic traversal and access, prototyping (including cases where Json interchange data format changes rapidly) is one case where Tree Model can be useful. The problem with Data Binding is the very tight coupling between Java classes and resulting Json. It is often less work to change the code that traverses Tree Model than to change the object model AND code that uses objects. And there is generally less code to traverse a tree than do deal with event streams.

6. Other suggestions?

Now that I have dumped my initial thoughts, I realize how scattered they are. :-)

I'm sure I am missing obvious things, or may be grouping unrelated things together. Please let me know what I got wrong!

Posted by Tatu Saloranta at Monday, February 02, 2009 10:11 PM
Categories: Java, JSON
| Permalink |Comments | links to this post

How to vet Killer Ideas

Blog entry "5 steps to a Great Startup Idea" caught my attention. The particularly insightful part for me was the third step ("The Crash"). It aligns with my thoughts on the matter: as the adage goes, "Great minds think alike". Many (most?) famous inventions ended up in a tight race for the first claim for a good (enough) implementation. And for what it's worth, most ideas I had that I consider(ed) particularly good were implemented by others sooner or later (things like triangulating web accessibility statistics from multiple points, and spreadsheet-in-javascript (~= google apps)); ideas that I contemplated focusing more on after building a prototype that worked well enough, but that I never had enough faith in to go commercial with.

One thing worth noticing of course is that this relationship is not symmetric: while good ideas are generally spotted by multiple entities, the reverse is not true: just because certain ideas are worked on by many doesn't mean they are good. Good judgment doesn't really grow up in trees, nor do all should-supported containers contain any. Even fools work on things that they consider interesting. That doesn't change the fact that they are fools, and who by definition won't be able to differentiate between good, bad and "whatever" ideas.

I guess one way to look at this is that if an idea is any good, there is or will soon be company in doing it. And if no one is (observed yet) doing it, you should at least be able to get "Ah, but of course!" reaction to it, when presented to skilled practicioners of your art.

Finally: I guess I will also have to side with Paul Graham and disagree with authors of the article itself: I do think ideas, even good ones, are more of a dime-a-dozen variety than otherwise. It may be that the distribution of idea follows some exponential curve (and it certainly is neither flat nor gaussian). But it has always seemed that the difference between success and lack thereof is that of choosing the ideas (and believing and ones chosen), not that of generating them.

Posted by Tatu Saloranta at Sunday, February 01, 2009 10:05 PM
Categories: Philosophic
| Permalink |Comments | links to this post

CowTalk

Moo-able Type for Cowtowncoder.com

Saturday, February 28, 2009

Working with annotations: Scannotations?

On musical influences: Roy Wood, one of the Greats (and Supertramp too)

Tuesday, February 24, 2009

Frustrations with Java Annotations

Monday, February 23, 2009

Links Galore: Bayesian classification, Nifty new language features for Java

Sunday, February 22, 2009

Update I on Update of Json-parsing performance

Inter-generational delights: the Mom Song

Thursday, February 19, 2009

Thought of the day: value of truth telling

Tuesday, February 17, 2009

Update on State of Json-parsing Performance

Saturday, February 14, 2009

Vaccine against Stupidity, anyone?

Friday, February 13, 2009

Tutorials for Two: StaxMate, Jackson

Religion vs Wealth: Less Allah, more Moolah

Monday, February 09, 2009

All Work and No Play makes Jack a .... serial killer?

Fast Object Serialization with Jackson (json), Aalto (xml)

Sunday, February 08, 2009

Improving XSLT performance, Saxon way

Thursday, February 05, 2009

Now there's a sound business plan...

Wednesday, February 04, 2009

Typed Access API tutorial, part II: arrays

Monday, February 02, 2009

Akuma Matata

Json processing with Jackson: Which method to use?

Sunday, February 01, 2009

How to vet Killer Ideas

Search

Last posts

Categories

Archives

Related Blogs

Powered By

About me