Tuesday, December 23, 2008

Gravity of Java Generics: Type Erasure Sucks

I must admit it now: Java implementation of Generics sucks. Yes, it is true. Even though I had nothing to do with creating the suckitude, I feel guilty for not recognizing acknowledging it up-front.

So what broke this camel's back? After all, I have been learning bits and pieces about how generics work for a while now, and finally feel competent to even explain how they work to others. I would even go as far as to claim I "understand Java generics" (phew -- that took a while -- better late than never!). Now: generics do make life bit easier; programming is bit more literal, and there are fewer casts even if there is more angle bracket pollution. So what if it's mostly syntactic sugar, fairy dust that is (mostly) swept away by compiler when writing bytecode out, not to be seen again?

Problem is this: although it is mostly ok that code can access the Type information only during compilation, this is poison for anything outside of the class that tries to make sense of the class after it has been compiled. That is, anything that would need Runtime (dynamic) Type Information. Especially so for things like data-binding tools that try to determine state of objects of certain class; and specifically when they try to peek into content type of container classes (Collections, Maps). Thing is: due to type erasure, all you know about that List is that, well, it is a List -- while compiler knows that you claim it is to contain, say, Foobarables, JVM only knows contents come in as "thing known as java.lang.Object"s. Bummer.

Is all lost? Almost, but not quite. Type erasure is... how did mr. Adams put it... [resulting classes are] "almost, but not quite, entirely" [without type information].

Just to illustrate the problem: you might think following makes sense:

List<String> result = binder.bind(inputSource, List<String>.class); // invalid, won't compile

But no: that won't compile. You can not squeeze a Class out of a generic type -- and even if you could, it would just be Thing Known As (in this case) List.class. (why? because typing has been, well, erased, result is the same old untyped List!). Teasingly, JDK actually does have Type classes under "java.lang.reflect" (including the thing that would serve us ok here, ParameterizedType!). But there are precious few ways of actually getting hold of one of these things, except for Class. Which, like we know by now, knows almost nothing about Type information.

But there is one sparkle of hope here. If you dig through JDK Javadocs, you will find that there are couple of hiding places where fragments of Type Information are actually hiding. Basically:

  • No dynamic type information is available. It's gone, zilch, nada, nihil. Not even with "generic methods": those are just syntactic sugar.
  • But some static type information is being stashed in the only place capable of hiding it: within the class definition.
  • That is: you can not generate type information by instantiating generic instances, but you can actually generate and retain some by declaring typed things. Specifically:
    • Methods retain their (generic) return type information
    • Method arguments likewise retain their type information
    • Sub-classes retain information about generic type parameters used to parameterize their supertypes.

Now: that is not much. But it actually may be just enough to make some things work, as long as we can get hold of one of things mentioned above. The main remaining problem is that of bootstrapping: for example, once you can get hold of a "getter" (getXxx) or "setter" (setXxx) method, you can tease out type information, if any is declared in class sources. And following classes from thereon may allow typing to be uncovered, for properly declared classes and methods. But this will not help with the initial call, such as one used above. You could of course require a dummy class to be implemented, with dummy method with either argument or return types. Or, dummy class extending a parameterized class, and then accesing super type's type information from sub-class. But boy do these sound convoluted.

But there has to be a way. And as usual, where there is will, there is a way. So let's look at one clever solution! I mean, next time, in the follow-up entry. Yours, mr. Cliff Hanger.

Wednesday, December 17, 2008

Tequila: Not Just For Breakfast Anymore!

Another public service announcement from the Woodstox project: the second release candidate for eventual 4.0.0-final release, affectionately named "3.9.9-2", hit the streets a minute ago. This is a fairly minor improvement over 3.9.9-1, consisting of:

  • One critical bug fix to XMLStreamReader.getElementText() implementation (hit upon by a performance test suite)
  • Improvement to packaging: not the source tarball FINALLY uses an intermediate directory so that you don't end up all the crap in your current directory
  • It is possible to use any Stax2 implementation (such as Woodstox) via OSGi service interface: Stax2 API now has "org.codehaus.stax2.osgi" package, which has provider objects that implementations can register as services (for input, output and validation factories). Check out 3.9.9-2 Javadocs for more details.

On a related note, I decided that a stiff major release like 4.0.0 deserves its own code name. So, here it is: Woodstox 4.0 will be hereforth known as "Tequila"; as in 'Woodstox 4.0 "Tequila"'. I'll drink to that!

Download responsibly!

Saturday, December 13, 2008

Jackson 0.9.5: Now With Bean Serialization!

[Update on 22-Oct-2011: a LOT has happened since this blog entry -- be sure to read "7 Jackson Killer Features", "Jackson 1.9" and other later entries here; as well as see FasterXML Jackson Wiki]

The latest Jackson Json-processor release, 0.9.5, is not loaded with tons of features: in fact there is pretty much just one significant new feature. But fortunately, this is the #2 requested feature: ability to serialize regular Java beans/POJOs (so what is #1? read on!). So in this case quality trumps quantity.

Serialization (s11n?) means ability to convert data contained in a Java object into a data format: in this case Json. Prior to 0.9.5, org.codehaus.jackson.map.JavaTypeMapper has been able to serialize some Java objects, but only those that are (or extend) basic Java types, such as java.util containers (Lists, Maps, Sets), wrappers (Boolean, Integer, Double etc), arrays (int[], long[], double[]) or "simple" values (Strings, nulls). While this is somewhat useful for loosely typed data processing, it is still bit low-level compared to normal dealing with Java Beans (simple but custom value types, where getXxx()/setXxx() naming convention is used to imply existence of set of accessible properties -- also known as Plain Old Java Objects, aka POJO).

This is no longer the case: now Jackson can fully serialize instances of any Java classes that use Bean-style method naming.

Let's have a look at a simple example of having class User (and related value classes), as follows:


public interface User {
  public enum Gender { MALE, FEMALE };

  public Name getName();
  public Address getAddress();
  public boolean isVerified();
  public Gender getGender();
  public byte[] getUserImage();
}

public interface Name {
  public String getFirst();
  public String getLast();
}
public interface Address {
  public String getStreet();
  public String getCity();
  public String getState();
  public int getZip();
  public String getCountry();
}

So how do we get Json out of an instance of User? Simple:

  User user = ...; // construct or fetch it
  File resultFile = new File("user.json");
  JavaTypeMapper mapper = new JavaTypeMapper();
  mapper.writeValue(resultFile, user);

and there we have it. Output looks something like (with added indentation for readability -- can be enabled in Jackson JsonGenerator as well, if necessary):


{
 "verified":true,
 "gender":"MALE",
 "userImage":"Rm9vYmFyIQ==", /* base64 encoded gif/jpeg; although this is invalid */
 "address": {
"street":"1 Deadend Street",
"city":"Mercer Island",
"zip":98040,
"state":"WA",
"country":"US"},
 "name": {
   "first":"Santa",
   "last":"Claus"
 }
}

Simple, ay? I thought so.

Want to know more?

Oh and the thing about #1 requested feature? Yes, you guessed it the Bean Deserialization. That's something for Jackson 0.9.6, if my crystal ball is right.

Monday, December 08, 2008

XStream 1.3.1 released: performance improvements

A patch release of XStream, version 1.3.1, was just released. Although it would seem like a minor update based on version number alone, here is one cool fact: this version is measurably faster than 1.3.0. Considering XStream's strong points -- ease of use, minimal configuration needed, serialization of object graphs -- it is good that one of bit weaker areas gets improved.

As an initial data point, my results from my "StaxBind" Japex-based test show +20% throughput for both reading and writing simple objects. While not earth-shattering, 20 percent, other there, and soon we are talking real numbers. :-)

Saturday, December 06, 2008

Typed Access API tutorial, part I

So far I have mentioned "Typed Access API" a few times over past blog entries. As in, probably often enough to irritate; at least given that it has mostly been just namedropping. But this is about to change: I will try to give a simple overview of common usage of the new API.

But first things first: API itself consist of not much more than 2 new interfaces:

  • org.codehaus.stax2.typed.TypedXMLStreamReader
  • org.codehaus.stax2.typed.TypedXMLStreamWriter

(both of which are implemented by matching Stax2 XMLStreamReader2 and XMLStreamWriter2 main-level interfaces)

And while there are plenty of methods in there, this is just due to combinatorial explosion due to different data types, structured types (int vs int array), and xml oddities (element vs attribute).

Given this conceptual simplicity (if not brevity), tutorials do not get too lengthy. Still, there's nothing quite as nice as bit of cut'n pastable code to get one started, so let's get coding.

This first tutorial focuses on so-called "simple" types: simple is defined as types supported other than array and binary types. The latter will be covered on follow-up entries.

1. Writing simple values

Let's first try outputting following simple data:

<entries>
  <entry id="1234">
<active>true</active>
<value>10.00</value>
  </entry>
</entries>

it could be done by:

  StringWriter sw = new StringWriter();
  TypedXMLStreamWriter tw = (TypedXMLStreamWriter) 
  XMLOutputFactory.newInstance().createStreamWriter(sw);
  tw.writeStartDocument();
  tw.writeStartElement("entries");
  tw.writeStartElement("entry");
  tw.writeIntAttribute(null, null, "id", 1234);
  tw.writeEndElement(); // /entry
  tw.writeStartElement("active");
  tw.writeBoolean(true);
  tw.writeEndElement();
  tw.writeStartElement("value");
  BigDecimal value = ...; // BigDecimal to keep exact decimal value (no rounding probs)
  tw.writeDecimal(value);
  tw.writeEndElement();
  tw.writeEndElement(); // /entries
  tw.writeEndDocument();

(for a more convenient way, I always recommend StaxMate helper lib -- but that'd lead to another blog entry so for now we'll just use "raw" Stax2 API)

There are also couple of more types in there: about the only 'advanced' simple type included is QName: which can be used to write properly namespaced qualified names; at least if the stream writer is in namespace-repairing mode which allows for automatic namespace declarations to be added by writer.

2. Reading simple values

Typed writing seems like a minor incremental improvement, nothing too drastic. Without type support you would just convert values to Strings; for example:

  tw.writeCharacters(String.valueOf(intValue));

would be functionally equivalent, although less efficient way to achieve the same.

Reader-side is where the action mostly is, since code will be more compact as well as more readable.
So, to read content written by code above, we could use something like:

  String docContent = sw.toString();
  TypedXMLStreamReader tr = (TypedXMLStreamReader) 
  XMLInputFactory.newInstance().createStreamReader(new 
  StringReader(docContent));
  tr.nextTag(); // to point to <entries>
  tr.require(XMLStreamConstants.START_ELEMENT, "", "entries"); // optional check
  tr.nextTag(); // to point to <entry>
  int id = tr.getAttributeAsInt(0); // or: 
  getAttributeAsInt(tr.getAttributeIndex(null, "id"))
  tr.nextTag(); // to point to <active>
  boolean isActive = tr.getElementAsBoolean();
  tr.nextTag(); // to point to <value>
  BigDecimal value = tr.getElementAsDecimal();
  tr.nextTag(); // closing </entry>
  tr.require(XMLStreamConstants.END_ELEMENT, "", "entry"); // optional check
  tr.nextTag(); // closing </entries>

3. So what's the Big Deal?

Ok, so code above is slightly simpler than the alternative: for example, instead of:

  int id = tr.getAttributeAsInt(0);

we could have used:

  int id;
  String value = tr.getAttributeValue(0);
  try {
    id = Integer.parseInt(value);
  } catch (IllegalArgumentException iae) {
    throw new XMLStreamException("value '"+value+"' not an int", tr.getLocation());
  }

(unless we are happy with a random IllegalArgumentException being thrown and can leave out try-catch block; but that will also lose contextual info on where in content problem occured, or with what input -- which is usually not the case)

But maybe added convenience is not that huge: most developers by now have written their own utility methods. There are still other benefits even just for these simple types (we'll cover benefits of non-simple types later on; they are more plentiful):

  • As implied above, proper exception handling is a plus: typed parser can provide more information about the actual problem (location, underlying data to convert)
  • Typed Access API is based on XML Schema Datatype: typing system very similar to Java type system, but not identical. Thus, Typed Access API will work better with other systems based on XML Schema Datatype than using basic JDK parsing/decoding methods. This improves interoperability.
  • Typed Access API methods can be (and in case of Woodstox, are) more efficient than DIY alternative. Based on initial testing, processing throughput can increase significantly even for simplest of types (like booleans, ints): currently by up to 20 - 30
  • Code is bit more readable, since methods explicitly state what is expected

4. Next Steps

Ok so far so good. But let's consider this a warm-up act before moving to "advanced" types: arrays and binary content; as well as custom decoding.

Silly performance musings: x.toString() vs (String)x

Here is something I have been wondering occasionally:

given an Object that I know to be of type String, is it more efficient to cast it to String, or to call .toString() on it?

Now, if you have C/C++ background, the question may not initially make any sense: isn't cast just syntactic sugar -- compile-time adjustment to offsets -- whereas toString() a virtual method call, relatively quite expensive? Well, no, casts in Java can actually be more complicated lookups through inheritance hierarchy, especially if casting to interfaces (although not as complicated as if Java had Multiple Inheritance). So in that sense, it could be that the method dispatch might be smaller cast. Further, since String.toString() just does 'return this', there is distinct chance it might over all be faster approach.

Now: I am sure that it is hard to find a case where possible performance difference would matter a lot -- both methods ought to be fast enough -- so this is an academic question. But it is easy enough to write code to test it out, so I did just that. If they ever make Trivial Pursuit: the Java micro benchmark edition, I will be so ready...

In thend end, results are surprisingly clear: it is at least twice as fast to cast Object to String than to call Object.toString(). So there you have it. No point in calling toString() if you truly know the type to be String (calling toString() is still useful if it's either String, or some object that renders as String, such as StringBuilder or StringBuffer). C/C++ intuition would have given the right answer, even if not for right reasons.

One interesting sidenote is that this is one case where JDK 1.6 does some funky thingamagic to casts: code is about twice as fast as under 1.4 or 1.5 (especially so for casts, but also for tiString()). Quite neat, whatever it does.

Oh, and the source code: this is extremely ugly; view at your discretion etc. But here goes:


public final class Test
{
    final static int ROUNDS = 999999;

    void test(Object[] args) throws Exception
    {
        boolean foo = false;

        while (true) {
            long now = System.currentTimeMillis();
            String desc;
            int result;

            if (foo) {
                desc = ".toString()";
                result = testToString(args);
            } else {
                desc = "(String)";
                result = testCast(args);
            }
            foo = !foo;

            now = System.currentTimeMillis() - now;
            System.out.println("Type: "+desc+"; time: "+now+" (result "+result+")");
            Thread.sleep(100L);
        }
    }

    int testToString(Object[] args)
    {
        int len = args.length;
        String str = "";
        for (int r = 0; r  ROUNDS; ++r) {
            for (int i = 0; i  len; ++i) {
                str = args[i].toString();
            }
        }
        return str.hashCode();
    }

    int testCast(Object[] args)
    {
        int len = args.length;
        String str = "";
        for (int r = 0; r  ROUNDS; ++r) {
            for (int i = 0; i  len; ++i) {
                str = (String) args[i];
            }
        }
        return str.hashCode();
    }

    public static void main(String[] args) throws Exception
    {
        if (args.length  1) {
            throw new Exception("Need Args!");
        }
        Object[] args2 = new Object[args.length];
        System.arraycopy(args, 0, args2, 0, args.length);
        new Test().test(args2);
    }
}

Thursday, December 04, 2008

How to alleviate the infamous "Xml Invalid Character" problem with Woodstox

1. The Problem

Have you ever hit a problem manifesting itself like so:

  Error: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character 
  ((CTRL-CHAR, code 12))

when parsing XML? There is a good chance you have, if you regularly process xml.

So how and why does it occur? For common use cases, where one controls both data source (stuff that gets written as xml) and reading side, problems seldom occur. Few people feel the urge to add such weird characters in the first place. It is usually only when a legacy data source is used for populating/constructing xml content; or when using a data source with very loose validation. Former often comes as a (c)rusty old Oracle data dump, and latter simplistic system where user data (from web form or such) is shoved straight into crusty old Oracle, to create the data compost (which eventually becomes legacy data). Either way, characters that cause problems are more often than not supposed to even be there.

But as importantly, while characters such as "vertical tab" or "form feed" are usually of little use nowdays (the are left-overs from days of past when one hand to use in-band signaling use crude mechanisms), they are also often non-problematic: web browsers, for example, mostly convert these to other harmless character codes (such as plain space) before displaying. So, they are colorless and tasteless. Expert for xml parsers, which are mandated by law (well, ok not quite, just by xml specs...) to report such irregularities.

So here's the catch: xml specification explicitly forbids using such character: as per XML specification, these characters can not be included in xml content, anywhere. Not in CDATA sections, not as attribute values, not in processing instructions (not with the mouse, not in the house, Sam... but I digress) With XML 1.1, you could actually use character entities to escape them. Too bad no one uses XML 1.1, and chances are few ever will (and this is due to, well, XML 1.1 sucking bad in many other respects -- one step fore, two back -- but I rantgress here).

2. Woodstox to Rescue

So what is one to do? Most developers intuitively reach for "how-do-I-disable-this-nasty-validation" button. Not so fast: while that is a possible work-around, it is not really a good solution. After all, broken "xml" content is still broken, you are just trying to sweep this inconvenient fact under the carpet.

Instead, one should try to rectify the problem at source. Now: although sometimes producer is not under your control (when you are being sent alleged "xml" content by someone not familiar with concepts like, say, xml...), quite often you do have control. If so, the first thing you should do is to verify that you never produce such pseudo-xml content with these evil characters. If not, you should pester the producer to read this blog entry. :-)

And this is where Woodstox 4.0.0 comes in [fanfare!]. Here is a new feature you might want to use to squash those pesky vertical tabs and their brethren:

  XMLOutputFactory f = new WstxOutputFactory();
  f.setProperty(WstxOutputProperties.P_OUTPUT_INVALID_CHAR_HANDLER,
new InvalidCharHandler.ReplacingHandler(' ')); XMLStreamWriter sw = f.createXMLStreamWriter(...);

So what does it do? If you didn't guess it, setting this property will make stream writer silently replace all Java characters that are not valid xml characters with given replacement character. This means that following unit test should pass:

  StringWriter w = new StringWriter(); sw.writeStartElement("a"); 
  sw.writeCharacters("Evil:\u000c!"); sw.writeEndElement(); sw.close(); 
  assertEquals("Evil: ");

That works quite nicely: I just started using it myself, for a simple DB-to-xml data dumper (and yes, an address had a form feed in it).

So if you are in the business of producing xml content, consider this a new tool for Greener data production. Woodstox to the rescue -- so that we can all breathe a little easier! (disclaimer: air pollution reduction not scientifically proven)

3. Small Print

Woodstox 4.0 is still in its pre-release phase, so while the latest release (3.9.9-1) has all the features detailed above funcioning correctly, the official release has not yet been cut. Use at your own risk. D(r)ink responsibly. But most of all -- have fun!

Wednesday, December 03, 2008

Optimizing Advertisements using Evolutionary Techniques

I have always been interest in different areas of Information Retrieval (lately mostly related to automatic classification, and fact extraction), and so this entry which talks about what a start-up called SnapAds does is quite interesting. Although the basic idea is quite straight-forward (try out variations of the ad, see which ones perform the best, evolve those; also avoids repetition which may reduce click-through), the devil is in details: and pay-off for good implementation could be huge. One needs to sift through lots of data to really learn; but with the scale of the web, smaller and smaller companies could try to do something similar. Very interesting!

Tuesday, December 02, 2008

Belated Thanks to Sun Open Source Folks!

I feel bad for it taking this long to write a simple "Thank You" blog entry, but better late than never. So: Thank You dear Sun Microsystems, for the nice GlassFish Awards Program (GAP)! Special thanks to the folks who pushed through to get GAP itself approved; among many other worthy candidate projects.

In addition to liking the idea of such awards in general, I am of course somewhat biased (or at least, extra thankful?) by the fact that I was fortunate enough to be among the lucky winners -- Pardon my bias (I am especially proud of this mention!). Participation was based on submitting a few SJSXP bug reports, some accompanied by patches (and as I have mentioned earlier, one getting released as part of JDK 1.6 updates).

It is nice to get something back for doing things inder category "I would have done it anyway, things that I keep on doing even-more-so, because of this serendipitous (sp?) awards program". :-)

I think this also underlines one of lesser-known rules of using money as motivational tool: it should not be used as the motivator for getting work done ("if you do X, I pay you Y") -- that is actually counter-productive, as thoroughly proven over the years (it essentially removes motivation to do X, UNLESS Y is paid: even if originally no such motivator was needed!) -- but rather, to reward things that one wants to see more of. This hopefully has the benefit of boosting morale without accidentally assigning a "price tag" for voluntary work.

At any rate: thank you Sun, let's do this again some time!



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.