Saturday, August 14, 2010

Another interesting(-looking) data store on Java platform: Krati?

Ok, looks like there is one more storage option I really should investigate, Krati. What seems appealing (at first glance) is the understanding that performance optimization on Java platform are quite distinct from those for systems written in C/C++ (or on Erlang and other distinct platforms). And especially trying to make good use of big discrepancy between performance of random access versus sequential access; given that latter can be an order of magnitude faster, it may well make sense to add more processing to be able to sequential writes even if higher-level abstraction was concurrent random-access.

Of course there are lots and lots of other choices: from stripped-down "traditional" storage (like using MySQL InnoDB, see for example g414-inno) to BDB variants, Tokyo Cabinet and Redis. And higher-level systems that use roll-your-own storage (like Cassandra does by default). And this is good, I think; for truly optimal performance one-solution-cant-fit-all -- different storage options are best fits for different system designs.

Monday, June 28, 2010

Async HTTP Client 1.0 released

Ok, this announcement went out last week, but it is important to re-iterate: version 1.0 of Ning async HTTP Client is now out!

This is good news for multiple reasons: obviously ability to do asynchronous (read: non-blocking, i.e. no need to use one thread for each and every open connection) HTTP communication is valuable in itself. But I also hope that this spurs some friendly competition in improving all HTTP-based communication on Java platform -- there are still features that are not implemented, and due to complexity of efficient connection handling, there should be still room for improvement with performance as well. And finally this should lead to wider adoption of this relatively new library, so that it gets properly battle-tested and proven.

I am actually planning on using this client for cases where regular blocking client could work, to see how well it performs and exactly how easy it is to use non-blocking API, compared to existing blocking alternatives.

Monday, May 31, 2010

On prioritizing Open Source projects, tasks: First In, First Out

I have been bit less productive with my "extra-curricular" activities (open source coding, writing blogs) lately. This is mostly due to intense focus on my paying daytime job, and is part of the natural cycle of things for me.
But altohugh I have had bit less time and energy to spend on these tasks, I have had relatively more time to think about things. It turns out that amount of time I have to think about things does not quite correlate with amount of time to do things (which is interesting thing in its own right, but maybe worth a separate blog entry). More thinking often leads to having more ideas for writing blog entries; even if writing activity itself is still constrained by time crunch.

1. FIFO as a priorization mechanism

Anyway: I have realized that a fundamental operating principle that I have with respect to managing my open source projects -- tasks within projects, relative focus between projects -- is that of trying to tackle oldest issues first. Good old First-in-First-out (FIFO) queuing of things.
Except for high-priority urgent bug fixes, which I generally fast track (but which fortunately are not very common), I do try to increase priority of issues that have not been fixed. This is actually quite different from many task priorization methods (even if it's well-known algorithm for operating system process priorization), but which I feel is an important part of maintaining "culture of excellence", ensuring that qualtity of project output remains high. And to reduce risk of letting most of your open source projects stagnate to death.

Thing is: as with fish (and guests, as per Mark Twaine), bugs also smell more longer they stay. Smell of code rot comes from long-standing unresolved issues. It also tends to be the case that it is easier to keep momentum than to get things rolling -- this makes it hard to revive stagnant projects; but possible to keep recently(-enough) worked-on projects chugging along.

Now: it is bit more well understood that teams should at least occasionally go through list of long-standing issues and ideally not only re-visit them but also fix. In for-profit development targets and priorities tend to fluctuate more than with labour-of-love projects that most open source projects still are. This tends to make it less likely that older issues get resolved, as priorities are more driven by who cries loudest, and most recently. Still, most good developers I know are uncomfortable leaving issues unresolved before starting to extend functionality, to build new things.

But I think FIFO principle as driving force for priorization goes beyond increasing priority of old tasks. I think it also should (and does, in my case) drive relative priorities of different projects (or services, systems, libs, frameworks). And this is something I also try to do more.

2. Example case: my own project priorization ("product backlog")

Specific case in point is that of my focus on my current "flagship" project, Jackson JSON processor. Jackson has had my main focus for well over a year now -- mostly since it is by far the most popular of things I have built; and will likely remain so for a while now. So if I chose to, I could spend all my time and then some just working on issues related to Jackson.
Doing so would, however, essentially kill other projects I am heavily involved with -- Woodstox, StaxMate, Aalto, Java Uuid Generator -- as well as prevent me from expanding to new areas (like upcoming "compact trie" package, google for "ning tr13"; more on this once it is ready to be announced). And that is not something I find compelling as an idea. Especially since code that is not worked on will start to rot at fast rate, and becomes useless surprisingly fast.

So: as frustrating it is to watch issues stack for important projects like Jackson, the way I try to do things is to cycle through projects that I consider worth keeping alive. Sometimes it just means flipping between two projects; and some projects may even become complete in their own right, meaning there just isn't that much to work on, and thus the cost of switching focus for some minor work just isn't worth it. But to give further example of what I mean, here is my current thinking of roughly how I hope to complete next tasks, with respect to projects I work on (note: priorities are not absolute or carved in stone; they are akin to Scrum sprint plan, if even that solid):

  1. Complete Java Uuid Generator rewrite to version 3.0
  2. Work on Woodstox 4.1, focusing on XML Schema handling improvements (working with couple of other OS developers who have stake in this area -- including work on Sun Multi-Schema Validator)
  3. Finish version 1.0 for the "compact trie" project
  4. Implement minor extensions for StaxMate, to get to version 2.1
  5. Jackson version 1.6: must have better Enum type handling; should have "materialized interfaces"; can not be delayed too long since there's always version 1.7 to write...
  6. Compact binary format alternative for JSON (with Jackson as reference implementation)
  7. ... maybe consider implementing "DataMate" (if you haven't heard my brainstorming on what it is, consider yourself lucky :) )
  8. Cycle through projects again!

So why consider work on JUG as the first priority? Well, for one, time is right to both upgrade it to be useful and relevant -- there are things like JDK 1.4 - introduced java.util.Uuid class; JDK-1.6 introduced access to Etherner Mac address; as well as "new" use cases (Cassandra, for example, uses Time-based UUIDs heavily!). And as importantly, work has reasonably limited scope: it will take about week more of focus to get it all done; as I have already spend some time over past month or so to make this reality.

And Woodstox? While XML does not have same momentum in J2EE world as it once had, Woodstox is very widely used, complete and useful library. But its XML Schema support has only recently been more heavily exercised, and multiple implementation flaws have been incovered. To make Woodstox relevant as a first-class Java XML Schema supporting tool, some work is needed. Further, there are some useful improvements in trunk that can only be released with version 4.1 (retro-fitting to 4.0 would be too risky).

I think you get the idea -- maybe not exactly why I feel this order is sensible, but at least see that there are multiple conflicting factors. I guess I know my own working habits well enough to know that "out of sight, out of mind", meaning that longer a project is not being worked on, harder it will be to get back to work on it. And so the best way to keep all the balls in the air is to juggle through them; and do this as a semi-formal process and not rely on user request to trigger such changes (esp. since there are more requests for work than time for it).

Wednesday, May 19, 2010

Un-hibernating projects: Java Uuid Generator, getting ready for 3.0!

As cycle of seasons has rolled to late spring, it is time for hibernating things -- bears, and stagnant open source projects -- to wake up and start moving. It just so happens that this is the case with venerable Java UUID Generator (JUG): my first true Open Source project.
Although it is not exactly the first thing that I ever released as open source (that would be something called "NetReaper", or perhaps "DLR", both from late 90s -- few have ever heard of them -- heck, even "Fractalizer" is older!), and much less the first piece of software I have released (shareware lib/app that compressed Amiga Soundtracker files using delta compression is probably the first one, from late 80s!), I count it as basically starting point of my open source "career".

So what is happening? Well: there is the new JUG project page (at the OS darling GitHub); matching skeletal JUG product page at FasterXML; and of course the brand new JUG users discussion group (java-uuid-generator-users) at Google Groups, waiting for users to talk about it. And probably most interestingly, actual development effort to produce third major version, 3.0.

Given that the project has spent past 5 or so years changing very little, why is there new development effort? Mostly because JDK finally caught up with JUG, so to speak -- JDK 1.6 finally has a pure Java method of accessing Ethernet interface MAC addresses -- but partly also because of other niceties that can now be added (java.util.UUID was added in JDK 1.4; which was not the stable version at the time of writing JUG 2.0). And finally, there's quite a bit of clean up that would be nice to do if I was to work on the code.

Given above, here are the modest goals for version 3.0

  • Add convenient support for using local Ethernet address, without using JNI library (requires JDK 1.6); and remove legacy code that was needed for JNI
  • Change UUID type to use from JUG-specific to java.util.UUID (also allows removing quite a bit of code)
  • Build/deployment changes: change build to Maven (including releasing builds to Maven repos); jars built as OSGi bundles as well
  • SCM changes: move from Safehaus/svn to GitHub/git
  • Improve API to avoid relying heavily on singletons; streamline for simpler (and perhaps more elegant) access
  • Add support for one "new" UUID generation method (using SHA-1 instead of MD5 for name/hash based generation)
  • Maybe even write a simple tutorial for using the lib!

Which is just to say, renovate the package so it does not feel quite so 2002 any more (which is when it was written originally). :-)

Monday, May 17, 2010

Finally: Java JSON Schema validator; based on Jackson

Ok here are some good news for Java developers who use (or might like to use) JSON, but are bothered by lack of data format validation options: Nicolas Vahlas has written the first Java JSON Schema validator, and it is available from Gitorious as project json-schema-validator.

I have not yet have time to dig deep into it, but there all signs for it being so-called Good Stuff. Not just because it is based on Jackson -- although proper reuse of existing solid components is a general good sign -- but because description gives an idea of author being someone actually knows what he is doing.

So please check it out if said functionality seems at all interesting: the best way to ensure it becomes a first-class tool (and maybe even help JSON Schema standard improve along the way) is to use it, give feedback, and get the whole flywheel-of-virtue (aka virtuous cycle) thing going on. That's how things like Jackson and Woodstox became good: feedback is the amplifier of open source productivity.

Ok, enough raving: I'm off to get the sources for bit of closer look. :-)

Monday, April 12, 2010

More efficient client-side HTTP handling with the new Async HTTP client @GitHub

1. Yet another HTTP-client?

Ok now: I am aware of the fact there are quite a few contestant for the "best Java HTTP client"; starting with the well-rounded and respected Apache HTTP Client (esp. version 4.0). But there is now a very promising, up and coming young challenger, aptly named Async HTTP Client ("Ning async http client", considering its corporate sponsor at Github) written by a very competent guy whose past work includes things like Glassfish, and especially its Atmosphere module (async http goodness; Comet, WebSocket etc).

Given it has the single most important thing an open source project needs (at least one technically strong developer who knows the domain well), I have high hopes for this project, and recommend you to keep it in mind if you need an HTTP client for high-volume server-side systems (why server-side? because that's where you typically need much more concurrent client-side HTTP access, when talking to other webb services).

2. Asynchronous? So... ?

So why does it actually mind whether you use blocking or non-blocking client? Well, the "async" (aka non-blocking) part is obviously important in general for highly concurrent use cases, where JVM thread scaling is not very good beyond hundreds of threads.
But more interestingly, it also really starts to matter when you have "branching" with your service: that is, for each call your service handles, it needs to make multiple calls to other services. With blocking http clients you either have to spin new threads (complicated, and somewhat costly); or do requests sequentially. Former can achieve low(er) latency; latter is simpler and more efficient. But with asynchronous calls, you can actually fire all (or some) requests concurrently, as early as possible; do some processing after this, and when necessary, check for request results (via Futures). While not as trivially easy as sequential calls, this can be almost as good, and with much improved latency.
High branching factor is what powers many high-volume web sites: for example, high-traffic web pages such as Amazon.com's pages are composed from multiple separately computed blocks, many of which are built based on multiple independent calls to backend services. This can not be done with tolerable latency by using sequential web service calls.

Beyond non-blocking part, it is also likely that over time blocking convenience facade will be developed as well, so it is not unreasonable to expect this to develop into more general-purpose solution for HTTP access (at least that is my personal opinion/wish).

Anyway: cool beans; we'll see how this project advances. So far progress has been remarkably rapid -- in fact, version 1.0 seems to be in sight; as tentative feature list has been discussed on the user list. More on 1.0 when it is out in the wild.

3. Disclosure

In spirit of full disclosure, I should mention that Jean-Francois (the author) is actually my current co-worker -- but at least I know what I am talking about when praising him. :-)

Thursday, October 29, 2009

On State of State Machines

State Machines are things that all programmers should recall from their basic Computer Science courses, along with other basics like binary trees and merge sort. But until fairly recently I thought that they are mostly useful as low-level constructs, built by generator code like compiler compilers and regular expression packages. This contempt was fueled by having seen multiple cases of gratuitous usage: for example, state machines were used to complicate simple task of parsing paragraphs of line-oriented configuration files when I worked at Sun. So my thinking was that state machines are only fit for compilers to produce, something non-kosher for enlightened developers.

But about two years ago I started realizing that state machines are actually nifty little devices, not only to be created by software but also by wetware. And that their main benefit can actually be simplicity of the resulting solution -- when used in right places.

1. Block Me Not

The first place where I realized usefulness of state machines was within bowels of an XML parser. Specifically, when trying to write a non-blocking (asynchronous) parsing core of Aalto XML processor (more on this nice piece of software engineering in future, I promise)

Challenge with writing a non-blocking parser is simple: whereas blocking parser -- one that explicitly reads input from a stream and can block until input is available -- has full control of control flow, including ability to only stop when it wants to (at a token boundary; after fully parsing an element, or comment), a non-blocking parser is at mercy of whoever feeds it with data. If there is no more data for non-blocking parser to read, it has to store whatever state it has and return control to caller, ready or not. Which basically means it may have to stop parsing at any given character; or even better, within half-way THROUGH a character, which can happen with multi-byte UTF-8 characters And do it in such way that whenever more data does become available, it is ready to resume parsing based on newly available data.

So what is needed to do that? Ability to fully store and restore the state. And this is where state machine made its entrance: gee, wouldn't it make sense to explicitly separate out state, and create a state machine to handle execution. Or, in this case, set of small state machines.

Indeed it does; and once you go that route implementation is not nearly as complicated as it would be if one tried to do it all using regular procedural code (which might just be infeasibe altogether)

2. All Your Base64 Are Belong To Us

Ok, complex state keeping should be an obvious place for state machines to rule. But much smaller tasks can benefit as well.
Base64 decoding is a good example: given that decoding needs to be flexible with respect to things like white space (linefeeds at arbitrary locations), possible limitations on amount that can be decoded with one pass (with incremental parsing, as is the case with Woodstox), and the need to handle possible padding at the end, writing a method that does base64 decoding is a non-trivial task. I tried doing that, and resulting code was anything but elegant. I would even go as far as call it fugly.

That is, until I realized I should apply earlier lessons and see what comes of simple state keeping and looping. Lo and behold, tight loop of base64 decoding is tight both by amount of code (rather small) and processing time (pretty damn fast). Resulting state machine has just 8 states (4 characters per 24-bit unit to decode, few more to handle padding), and code is surprisingly simple and easy to follow (but still long enough not to be included here -- check out Woodstox/Stax2 API class "org.codehaus.stax2.ri.typed.CharArrayBase64Decoder" if you are interested in details).

3. Case of "I really should have..."

One more case where state machine approach would probably have worked well is that of "decoding framed XML stream".

At work, there is an analysis system that has to read gigabytes of data. Data consists of a sequence of short XML documents, separated by marker byte sequences that act as simple framing mechanism. Task itself is simple: take a stream, split it into segments (by markers), feed to parser. But to make it both reliable and efficient is not quite as easy: marker sequence consists of multiple bytes, and theoretically bytes in question could belong to a document: it's the full sequence that can not be contained within document. Plus for extra credit one should try to avoid having to re-read data multiple times.

So, foolishly I went ahead and managed to write piece of code that does such de-framing (demultiplexing) efficiently (which is needed for scale of processing we do). But code looks butt ugly; and took a bit of testing to make work correctly. Unfortunately I only had the light bulb moment after writing (... and fixing) the code: Would this not be a PERFECT case for writing a little state machine, where one state is used for each byte of the marker sequence?

Maybe next time I actually consider techniques I recently re-discovered, and apply them appropriately. :-)

Tuesday, August 04, 2009

Jackson 1.2 released!

Even without Netcraft confirming it, this much is official:Jackson 1.2.0 has just been released, and is available from here.

Beside 2 Grand Features already show-cased (BYOC, Mix-in Annotations), here are some other choice additions:

  • Ability to use declared (static) type for serialization, instead of runtime type: useful if you want to enforce public API, and avoid leaking implementation details
  • Ability to suppress errors for unknown (unrecognized) properties
  • More support for non-standard JSON content: can optionally parse content where field names are not quoted (something Javascript allows and standard JSON not)
  • Ability to use "delegating" creators -- constructors and factory methods that take intermediate bound object to construct POJO from. Typically this means taking in a Map<String,Object> (which Jackson binds), and then extracting loosely typed data for POJO
  • ObjectMapper now implements JsonNodeFactory: can construct all JsonNode types without explicitly getting a factory implementation -- useful when building Tree Models from scratch

As always, download responsibly! And don't forget to click the advert... I mean, tip the waitresses.

ps. I'll be off for a brief (2 week) vacation, so there won't be much new material to read here.

Thursday, July 09, 2009

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

  1. Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
  2. If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Saturday, June 27, 2009

Woodstox, high impact factor & being #32 on Top Open Source Java libs list

Another interesting data point, this time from analysing Maven Dependency paths: "Most Referenced" list. Looks like Woodstox is quite widely used by projects that use or at least declare their dependencies using Maven: I assume magic number 1838 (which gives rank #32) could mean number of other projects depending on Woodstox. Not too shabby for an xml parser. Getting on the first result page is quite remarkable; especially considering that Woodstox ranks higher than many other worthy Java open source libraries like XStream, Hibernate, Quartz, Xalan and Velocity. And only slightly (by about 50% :-) ) trailing such ubiquitous thingy as Spring.

Although this is just one of way of estimating popularity of various (Java) OS libs, it is still interesting, because it has similarities to how scientific articles are ranked (impact factor; although here weights are uniform). And also since it could lend itself to Google PageRank style extensions as well... let's see.

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.