Thursday, October 29, 2009

On State of State Machines

State Machines are things that all programmers should recall from their basic Computer Science courses, along with other basics like binary trees and merge sort. But until fairly recently I thought that they are mostly useful as low-level constructs, built by generator code like compiler compilers and regular expression packages. This contempt was fueled by having seen multiple cases of gratuitous usage: for example, state machines were used to complicate simple task of parsing paragraphs of line-oriented configuration files when I worked at Sun. So my thinking was that state machines are only fit for compilers to produce, something non-kosher for enlightened developers.

But about two years ago I started realizing that state machines are actually nifty little devices, not only to be created by software but also by wetware. And that their main benefit can actually be simplicity of the resulting solution -- when used in right places.

1. Block Me Not

The first place where I realized usefulness of state machines was within bowels of an XML parser. Specifically, when trying to write a non-blocking (asynchronous) parsing core of Aalto XML processor (more on this nice piece of software engineering in future, I promise)

Challenge with writing a non-blocking parser is simple: whereas blocking parser -- one that explicitly reads input from a stream and can block until input is available -- has full control of control flow, including ability to only stop when it wants to (at a token boundary; after fully parsing an element, or comment), a non-blocking parser is at mercy of whoever feeds it with data. If there is no more data for non-blocking parser to read, it has to store whatever state it has and return control to caller, ready or not. Which basically means it may have to stop parsing at any given character; or even better, within half-way THROUGH a character, which can happen with multi-byte UTF-8 characters And do it in such way that whenever more data does become available, it is ready to resume parsing based on newly available data.

So what is needed to do that? Ability to fully store and restore the state. And this is where state machine made its entrance: gee, wouldn't it make sense to explicitly separate out state, and create a state machine to handle execution. Or, in this case, set of small state machines.

Indeed it does; and once you go that route implementation is not nearly as complicated as it would be if one tried to do it all using regular procedural code (which might just be infeasibe altogether)

2. All Your Base64 Are Belong To Us

Ok, complex state keeping should be an obvious place for state machines to rule. But much smaller tasks can benefit as well.
Base64 decoding is a good example: given that decoding needs to be flexible with respect to things like white space (linefeeds at arbitrary locations), possible limitations on amount that can be decoded with one pass (with incremental parsing, as is the case with Woodstox), and the need to handle possible padding at the end, writing a method that does base64 decoding is a non-trivial task. I tried doing that, and resulting code was anything but elegant. I would even go as far as call it fugly.

That is, until I realized I should apply earlier lessons and see what comes of simple state keeping and looping. Lo and behold, tight loop of base64 decoding is tight both by amount of code (rather small) and processing time (pretty damn fast). Resulting state machine has just 8 states (4 characters per 24-bit unit to decode, few more to handle padding), and code is surprisingly simple and easy to follow (but still long enough not to be included here -- check out Woodstox/Stax2 API class "org.codehaus.stax2.ri.typed.CharArrayBase64Decoder" if you are interested in details).

3. Case of "I really should have..."

One more case where state machine approach would probably have worked well is that of "decoding framed XML stream".

At work, there is an analysis system that has to read gigabytes of data. Data consists of a sequence of short XML documents, separated by marker byte sequences that act as simple framing mechanism. Task itself is simple: take a stream, split it into segments (by markers), feed to parser. But to make it both reliable and efficient is not quite as easy: marker sequence consists of multiple bytes, and theoretically bytes in question could belong to a document: it's the full sequence that can not be contained within document. Plus for extra credit one should try to avoid having to re-read data multiple times.

So, foolishly I went ahead and managed to write piece of code that does such de-framing (demultiplexing) efficiently (which is needed for scale of processing we do). But code looks butt ugly; and took a bit of testing to make work correctly. Unfortunately I only had the light bulb moment after writing (... and fixing) the code: Would this not be a PERFECT case for writing a little state machine, where one state is used for each byte of the marker sequence?

Maybe next time I actually consider techniques I recently re-discovered, and apply them appropriately. :-)

Tuesday, August 04, 2009

Jackson 1.2 released!

Even without Netcraft confirming it, this much is official:Jackson 1.2.0 has just been released, and is available from here.

Beside 2 Grand Features already show-cased (BYOC, Mix-in Annotations), here are some other choice additions:

  • Ability to use declared (static) type for serialization, instead of runtime type: useful if you want to enforce public API, and avoid leaking implementation details
  • Ability to suppress errors for unknown (unrecognized) properties
  • More support for non-standard JSON content: can optionally parse content where field names are not quoted (something Javascript allows and standard JSON not)
  • Ability to use "delegating" creators -- constructors and factory methods that take intermediate bound object to construct POJO from. Typically this means taking in a Map<String,Object> (which Jackson binds), and then extracting loosely typed data for POJO
  • ObjectMapper now implements JsonNodeFactory: can construct all JsonNode types without explicitly getting a factory implementation -- useful when building Tree Models from scratch

As always, download responsibly! And don't forget to click the advert... I mean, tip the waitresses.

ps. I'll be off for a brief (2 week) vacation, so there won't be much new material to read here.

Thursday, July 09, 2009

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

  1. Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
  2. If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Saturday, June 27, 2009

Woodstox, high impact factor & being #32 on Top Open Source Java libs list

Another interesting data point, this time from analysing Maven Dependency paths: "Most Referenced" list. Looks like Woodstox is quite widely used by projects that use or at least declare their dependencies using Maven: I assume magic number 1838 (which gives rank #32) could mean number of other projects depending on Woodstox. Not too shabby for an xml parser. Getting on the first result page is quite remarkable; especially considering that Woodstox ranks higher than many other worthy Java open source libraries like XStream, Hibernate, Quartz, Xalan and Velocity. And only slightly (by about 50% :-) ) trailing such ubiquitous thingy as Spring.

Although this is just one of way of estimating popularity of various (Java) OS libs, it is still interesting, because it has similarities to how scientific articles are ranked (impact factor; although here weights are uniform). And also since it could lend itself to Google PageRank style extensions as well... let's see.

Monday, June 22, 2009

Jackson Goes All 1.1

Due to rapid speed of Jackson JSON processor development, a significant new release of was just cut. The release goes with catchy nickname "1.1" (given that 1.0 was to be known as "Hazelnut", this could perhaps be known as "Macadamia", but I digress).
Beyond obvious utility aspect (I use Jackson myself and want to start use some of new features at work, with the "official" version), this should also be good for getting feedback on some exciting new features (like JSON Schema generation).

Here are the highlights from 1.1 announcement (for more complete list, refer to full 1.1 release notes):

  • Support for JAXB annotations: you can reuse existing JAXB-annotated beans; and support can optionally be combined with 'native' Jackson annotations (using AnnotationIntrospector.Pair for chaining)
  • Ability to generate JSON Schema definitions using Jackson serializers on arbitrary POJO (package for schema is "org.codehaus.jackson.schema", part of Mapper jar, but it is invoked using ObjectMapper like all data mapping operations)
  • Support for direct field access: public member fields and explicitly annotated fields (using @JsonProperty) can be serialized, deserialized. And unlike with JAXB, it is ok to find both field and methods (methods have precedence if this happens
  • Annotation set has been streamlined: although all existing 1.0 annotations work (and will work for all 1.x releases); almost all functionality can be defined using but 3 new annotations:
    • @JsonProperty for indicating getters/setters/accessible fields, and to override logical property name associated if need b
    • @JsonSerialize to configure serialization (external serializer to use, whether to output null/non-default properties etc
    • @JsonDeserialize to configure deserialization (external deserializer to use, sub-types to use)

Part of new functionality (namely, JAXB annotation support) lives in a brand new jar ("jackson-xc" aka "Xtra-Curricular stuff"); otherwise deployment aspects haven't changed.

With 1.1 done, development for 1.2 version can start next. There's a big list of more functionality to implement -- but discussing that will be worth a separate blog entry. Stay tuned!
(and remember to check out 1.1 JavaDocs at Codehaus: it's the easiest way to document things and I really try to make them useful)

ps. The Really Useful Backing Corporate Entity known as FasterXML.com offers full support for using this new version to maximum effect. Just in case you weren't aware of such support.

Tuesday, June 09, 2009

Faster, XML, Faster!

It appears that FasterXML -- the commercial support organization behind Jackson, Woodstox, StaxMate and Aalto) is debuting on Seattle Startup Scene: according to this survey, it is close to breaking into hotly contested Northwest Startups Top-300 list. :-)
In fact, one of our fellow up-and-comers, MarketOutsider (hi Bryce!) is within our sight with ranking north of 300 limit.

One of important next steps will be figuring out exact details of licensing for Aalto -- it is something that actually has lots of potential, even if it is bit of a uncut diamond right now. Its asynchronous (non-blocking) parsing specifically should be very useful for high-concurrency (thousands of concurrent connections) use cases. And being 2x as fast as Woodstox (essentially, as fast as fast C XML parsers!) is nice as well. Shaving off CPU cycles pays off if you pay by cycle (think EC2).

And beyond that, it would be good to get to build some of actual new products, from Hadoop-on-S3 processing systems to plug-n-play database front-end web services. And of course all the momentum Jackson has: maybe it'll work nicely with GWT in near future.
But more on these things when plans inch forward.

Tuesday, May 19, 2009

On importance of choosing the right tool

Tool Choice Matters: it makes the difference between "Nailed it" and "Screwed it up"...

Friday, May 15, 2009

How many classes does it take to serialize a POJO?

(or: the usefulness of class count as metrics for simplicity)

A recent JSON package comparison got me thinking about perceived simplicity (or lack thereof) of libraries. While I do not really think number of classes is a generally useful metrics of a package (nor generally correlate with its fitness), I can at least see how an argument could be made that sometimes "small is beautiful" (especially if a more accurate metrics like resulting jar size was used) -- it would seem strange if supporting libraries are significantly larger than the main code of a plug-in or such.

But what I think is the actual fallacy is using number of implementation classes as some sort of proxy for simplicity of package; especially regarding being simple to use (intuitive, easy to use etc). As in assuming that a package with, say, 12 classes, is simpler to use than one with 250 classes. The problem is this: from user perspective, only those classes that user has to directly interact with really matter: they are the public API, and contain all the complexity user is faced with. Implementation classes seldom matter -- they are there, get used, but are not exposed to you. There is no cognitive load on such implementation details.

So back to the original question title asked: regarding Jackson specifically, how many classes do YOU as a developer really need to know to use it?

I think it can be as low as just one: the all-powerful (org.codehaus.jackson.map.)ObjectMapper.
For most users, that's the only class they need to be familiar with, from within Jackson class library.

And even power users only need to know a couple of additional classes:

  • (org.codehaus.jackson.)JsonFactory for constructing other things
  • (org.codehaus.jackson.)JsonParser if streaming parsing (or data binding, tree model) is needed
  • (org.codehaus.jackson.)JsonGenerator if streaming JSON writing (or bean, tree model serialization) is needed
  • (org.codehaus.jackson.)JsonNode if Tree Model is used for processing, instead of or in addition to streaming processing or data binding.

which would give us grand total of 5 classes you need to familiarize yourself with. And for good developers that deal with error cases, one more (JsonException) for bit more of error handling.

From there on, additional classes (exceptions, configuration objects) are only needed when more functionality is needed; and most of additional classes are rather simple: especially annotations which usually are little more than markers, tags.

In fact, another allegedly "simpler" with 7 classes probably requires you to know all them. And chances are there is less modularity in division of concerns, likely leaking unnecessary implementation details into API.

Of course, this is not the only problem with "classes as measure of complexity" idea -- having a properly modular API with more classes can be much more palatable than one with just a single monster swiss pocket knife class -- but it should be enough to get you thinking seriously whether to apply such simplistic metrics for evaluating simplicity.

Assessing simplicity has lots of complexity to it. And fundamentally, like beauty, simplicity is in the eye of beholder.

Monday, May 11, 2009

Jackson JSON-processor turns 1.0.0

Ok: it is now official: the official Jackson JSON-processor version 1.0.0 has just been released. Get it while it's Hot!

Friday, May 01, 2009

Another JSR with Potential for Goodness: JSR-303, Bean Validation API

Here's something less depressing (... than the acquisition of Sun by one of worst possible suitors [IMO]) from the Java land: JSR 303, "The Bean Validation API", seems like a rather useful little tool to have.

So what is it? Basically, it is a pluggable annotation-based component for validating data constraints on beans, typically used for things like validating user input like web forms. My personal interest, however, is more related to another obvious use case: that of validating request messages for web services. Either way, writing validation code is brain-numbing dull monkey coding. Writing such validation has been a necessary part of many Java developers daily job. But with this new API (and more importantly the leading implementation by Hibernate team) the programming part can be mostly eliminated; and the rest will be simple matter of applying annotation. At least when defining simple rigid data type constraints (min/max values, lengths, non-null, matches a regexp).

Instead, constraints to validate can be declared by simple standard annotations (and/or custom ones that can be built using guidelines and components from the API), attached to Bean fields and/or access methods. For example:

import javax.validation.constraints.*;

  public class MyBean
  {
     @NotNull // can't be null (not optional
     @Size(min=4, max=40) // length, [4, 40]
     String name;

     @Max(20) // no more than 20
     int retries;

     @NotNull
     @Valid // means that instance is recursively validated
     OtherBean childBean;
  }

And validation itself is done using something like

  ValidatorFactory factory = Validation.buildDefaultValidatorFactory();
  Validator v = factory.getValidator();
  Set<ConstraintViolation<MyBean>> probs = v.validate(myBeanInstance);

which returns set of ConstraintViolations, each of which details field that had the problem (path to field via references using dot notation) and matching localizable problem description.

For further discussion, this article is a good follow up; and JavaDocs should be enough to get you going with the details.

About me

  • I am known as Cowtowncoder
  • Contact me at @yahoo.com
Check my profile to learn more.

Powered By