Thursday, December 31, 2009

Upgrading from Woodstox 3.x to 4.0

It has now been almost one year since Woodstox 4.0 was released.
Given this, it would be interesting to know how many Woodstox users continue using older versions, and how many have upgraded.

My guess (somewhat educated, too, based on bug reports and some statistcs on Maven dependencies) is that adoption has been quite slow. I think this is primarily due to 3 things:

  1. Older versions work well, and fulfill all current needs of the user
  2. New functionality that 4.0 offers is not widely known, and/or is not (currently!) needed
  3. There are concerns that because this is a major version upgrade, upgrade might not go smoothly.

I can not argue against (1): Woodstox has been a rather solid product since first official releases; and 3.2 in particular is a well-rounded rock solid XML processor (if you are using an earlier version, however, at least upgrade to latest 3.2 patch version, 3.2.9!).
And with respect to (2), I have covered most important pieces of new functionality, Typed Access API and Schema Validation.

But so far I have not written anything about incompatible changes between 3.2 and 4.0 versions. So let's rectify that omission.

1. Why Upgrade?

But first: maybe it is worth iterating couple of reasons why you might want to upgrade at all:

  1. You might want to validate XML documents you read or write against W3C Schema (aka XML Schema). Earlier versions only allowed validating against DTDs and Relax NG schemas
  2. If you want to access typed content -- that is, numbers, XML qualified names, even binary content, contained as XML text -- new Typed Access API simplifies code a lot, and also makes it more efficient.
  3. Latest versions of useful helper libraries like StaxMate require Woodstox 4.0 (StaxMate 2.0 needs 4.x, for example)
  4. No new development will be done for 3.2 branch; and eventually not even bug fixes.

Assuming you might want to upgrade, what possible issues could you face?

2. Backwards incompatible changes since 3.2

Based on my own experiences, there are few issues with upgrade. Although the official list of incompatibilities has a few entries, I have only really noticed one class of things that tend to fail: Unit tests!

Sounds bad? Actually, yes and no: no, because these are not real failures (ones I have seen). And yes, since it means that you end up fixing broken test code (extra overhead without tangible benefits). But this is one of challenges with unit tests: fragility is often desireable, but not always so.

Specific problem that I have seen multiple times is related to one cosmetic aspect of XML: inclusion of white space with elements.

Woodstox 3.2 used to output empty elements with "extra" white space, like so:

<empty />

but 4.0 will not add this white space:

<empty/>

(this is a new feature as per WSTX-125 Jira entry)

and so some existing unit tests for systems I have worked on compare literal XML for output tests. This is not optimal, but it is bit less work than writing tests in more robust way, to check for logical (not physical) equality. So whereas they formerly assume existence of such white space, tests need to be modified not to expect it (or allow either way).

3. Other challenges?

Actually, I have not seen any actual problems, or other cosmetic problems. But here are other changes that are most likely to cause compatibility problems (refer to the full list mentioned earlier for couple of changes that are much less likely to do so):

  • "Default namespace" and "no prefix" are now consistently reported as empty Strings, not nulls (unless explicitly specified otherwise in relevant Stax/Stax2 Javadocs). Usually this does not cause problems, because Stax-dependant code has had to deal with inconsistencies with other Stax implementations; but could cause problems if code is expecting null.
  • "IS_COALESCING" was (accidentally) enabled for Woodstox versions prior to 4.0. This was fixed for 4.0 (as per Stax specification), but it is possible that some code was assuming on never getting partial text segments (if developer was not aware of Stax allowing such splitting of segment, similar to how SAX API does it.

4. Upgrade or not?

I would recommend investigating upgrade; if for nothing else, because of maintenance aspect. Pre-4.0 versions will not be actively maintained in future. But it is good to be aware of what has changed, and of course having good set of unit tests should guard against unexpected problems.

And hey, it's soon 2010 -- Woodstox 3.2 is soooo 2008. :-)

Wednesday, October 28, 2009

Data Format anti-patterns: converting between secondary artifacts (like xml to json)

One commonly asked but fundamentally flawed question is "how do I convert xml to json" (or vice versa).
Given frequency at which I have encountered it, it probably ranks high on list of data format anti-patterns.

And just to be clear: I don't mean that there is any problem in having (or wanting to have) systems that produce data using multiple alternative data formats (views, representations). Quite on contrary: ability to do so is at core of REST(-like) web services, which are one useful form of web services. Rather, I think it is wrong to convert between such representations.

1. Why is it Anti-pattern?

Simply put: you should never convert from secondary (non-authoritative) representation into another such representation. Rather, you should render your source data (which is usually in relational model, or objects) into such secondary formats. So: if you need xml, map your objects to xml (using JAXB or XStream or what you have); if you need JSON, map it using Jackson. And ditto for the reverse direction.

This of course implies that there are cases where such transformation might make sense: namely, when your data storage format is XML (Native Xml DBs) or Json (CouchDB). In those cases you just have to worry about the practical problem of model/format impedance, similar to what happens when doing Object-Relational Mapping (ORM).

2. Ok: simple case is simple, but how about multiple mappings?

Sometimes you do need multi-step processing; for example, if your data lives in the database. Following my earlier suggestion, it would seem like you should convert directly from relational model (storage format) into resulting transfer format (json or xml). Ideally, yes: if there are such conversions. But in practice it is more likely that a two-phase mapping (ORM from database to objects; and then from objects to xml or json) works better: mostly because there are good tools for separate phases, but fewer that would do the end-to-end rendition.

Is this wrong? No. To understand why, it is necessary to understand 3 classes of formats that are talking about:

  • Persistence (storage) format, used for storing your data: usually relational model but can be something else as well (objects for object DBs; XML for native XML databases)
  • Processing format: Objects or structs of your processing language (POJOs for Java) that you use for actual processing. Occasionally this can also be something more exotic; like XML when using XSLT (or relational data for complicated reporting queries)
  • Transfer format: Serialization format used to transfer data between end points (or sometimes time-shifting, saving state over restart); may be closely bound to processing format (as is the case for Java serialization)

So what I am really saying is that you should not transfer within a class of formats; in this case between 2 alternate transfer formats. It is acceptable (and often sensible) to do conversions between classes of formats; and sometimes doing 2 transforms is simpler than trying to one bigger one. Just not within a class.

3. Three Formats may be simpler than Just One

One more thing about above-mentioned three formats: there is also a related fallacy of thinking that there is a problem if you are using multiple formats/models (like relational model for storage, objects for processing and xml or json for transfer). Assumption is that additional transformations needed to convert between representations is wasteful enough to be a problem in and of itself. But it should be rather obvious why there are often distinct models and formats in use: because each is optimal for specific use case. Storage format is good for, gee, storing data; processing model good for efficiently massaging data, and transfer format good for piping it through the wire. As long as you don't add gratuitous conversions in-between, transforming on boundary is completely sensible; especially considering alternative of trying to find a single model that works for all cases. One only needs to consider case of "XML for everything" cluster (esp. XML for processing, aka XSLT) to see why this is an approach that should be avoided (or, Java serialization as transfer format -- that is another anti-pattern in and of itself).

Thursday, October 01, 2009

Critical updates: Woodstox 4.0.6 released

This just in: Woodstox 4.0.6 was released, and it contains just one fix; but that one to a critical problem (text content truncation for long CDATA sections, when using XMLStreamReader.getElementText()). Upgrade is highly recommended for anyone using earlier 4.0 releases.

One more potentially useful addition is that I uploaded "relocation" Maven pom, for non-existing artifact "wstx-asl" v4.0.6 (the real id is "woostox-core-asl", as of 4.0; "wstx-asl" was used with 3.2 and previous). This was suggested by a user, to make upgrade bit less painful -- problem is that Woodstox tends to be one of those ubiquitous transitive dependencies to anyone running a Soap service (or nowadays almost any server-side XML processing system).

Next big thing should then be Jackson-1.3, stay tuned!

Friday, September 18, 2009

Typed Access API tutorial, part III/b: binary data, server-side

(note: this is part B of "Typed Access API tutorial: binary data"; first part can be found here)

1. Server-side

After implementing the client, let's next implement matching sample service that simply reads all files from a directory and creates download message that contains all files along with checksums for verifying their correctness (in real use case, those would probably be pre-computed). Simplest way to deploy service is as a Servlet-based web application; a single class and matching web.xml will do the trick.

Resulting code is meant to just show how (relatively) simple handling of binary data is -- obviously a real client and service would have much more checking for error cases, as well as for authentication, authorization, namespacing to avoid collision and so on.

Full source code can be found from Woodstox source code repository (see 'src/samples/BinaryService.java') but here is the beef:


    public void doGet(HttpServletRequest req, HttpServletResponse resp)
        throws IOException
    {
        resp.setContentType("text/xml");
        try {
            writeFileContentsAsXML(resp.getOutputStream());
        } catch (XMLStreamException e) {
            throw new IOException(e);
        }
    }

    final static String DIGEST_TYPE = "SHA"; 

private void writeFileContentsAsXML(OutputStream out) throws IOException, XMLStreamException { XMLStreamWriter2 sw = (XMLStreamWriter2) _xmlOutputFactory.createXMLStreamWriter(out); sw.writeStartDocument(); sw.writeStartElement("files"); byte[] buffer = new byte[4000]; MessageDigest md; try { md = MessageDigest.getInstance(DIGEST_TYPE); } catch (Exception e) { // no such hash type? throw new IOException(e); } for (File f : _downloadableFiles.listFiles()) { sw.writeStartElement("file"); sw.writeAttribute("name", f.getName()); sw.writeAttribute("checksumType", DIGEST_TYPE); FileInputStream fis = new FileInputStream(f); int count; while ((count = fis.read(buffer)) != -1) { md.update(buffer, 0, count);
// note: can write separate chunks without problems sw.writeBinary(buffer, 0, count); } fis.close(); sw.writeEndElement(); // file sw.writeStartElement("checksum"); sw.writeBinaryAttribute("", "", "value", md.digest()); sw.writeEndElement(); // checksum } sw.writeEndElement(); // files sw.writeEndDocument(); sw.close(); }

As with the client, there really isn't anything too special here. Just the usual service, with bit of Stax2 Typed Access API usage.

I briefly tested this by bundling it up as a web app (if you want to do the same, run Ant target "war.samples" in Woodstox trunk), running web app under Jetty 6.1, and accessing from both web browser and via BinaryClient class. Worked as expected right away (which, granted, was somewhat unexpected... usually there are minor tweaks needed, but not today).

2. Output

Just to give an idea of what results should look like, here's what I can see when download a single file (run.sh):


<?xml version='1.0' encoding='UTF-8'?>
<files><file name="run.sh" checksumType="SHA">IyEvYmluL3NoCgojIExldCdzIGxpbWl0IG1lbW9yeSwgZm9yIHBlcmZvcm1hbmNlIHRlc3RzIHRv IGFjY3VyYXRlbHkgY2FwdHVyZSBHQyBvdmVyaGVhZAoKIyAtRGphdmEuY29tcGlsZXI9IC1jbGll bnQgXApqYXZhIC1YWDpDb21waWxlVGhyZXNob2xkPTEwMDAgLVhteDQ4bSAtWG1zMTZtIC1zZXJ2 ZXJcCiAtY3AgbGliL3N0YXgtYXBpLTEuMC4xLmphcjpsaWIvc3RheF9yaS5qYXJcCjpsaWIvbXN2 L1wqXAo6bGliL2p1bml0L2p1bml0LTMuOC4xLmphclwKOmJ1aWxkL2NsYXNzZXMvd29vZHN0b3g6 YnVpbGQvY2xhc3Nlcy9zdGF4MlwKOnRlc3QvY2xhc3NlczpidWlsZC9jbGFzc2VzL3Rvb2w6YnVp bGQvY2xhc3Nlcy9zYW1wbGVzXAogJCoK</file><checksum value="qAZIQ6GDUJYRgiubW/H+5GZaWg0="/></files>

3. More to known about Base64 variants

One more thing to note is the existence of multiple slightly incompatible Base64 variants (see "URL Applications" section). So which one does Typed Access API use?

The one you define it to use, of course! Stax2 API actually allows caller to specify the variant to use -- sample code just happens to use the default variant (i.e. uses methods that just call alternatives that do take a Base64Variant argument). Stax2-defined Base64 variants (from class 'org.codehaus.stax2.typed.Base64Variants') are:

  • MIME: this is what is usually considered "the base64" variant: uses default alphabet, requires padding, and uses 76-character lines with linefeed for content. This is the default variant used for element content.
  • MIME_NO_LINEFEEDS is similar to MIME, but does not split output in lines -- this is the default variant used for attribute values (due to verbosiveness caused by encoding linefeeds in XML attribute values)
  • PEM is similar to MIME, but mandates shorter (60 character) line length
  • MODIFIED_FOR_URL: uses alternate alphabet (hyphen and underscore instead of plus and slash), does not use padding or line splitting.

And these are all implemented by Woodstox. In addition, one can use custom encodings by implementing custom Base64Variant object and passing that explicitly to base64-binary read- and write-methods.

4. Performance?

Beyond simple usage shown so far, what more is there to know about handling binary data?

One open question is performance: how much faster is Typed Access API, compared to using alternatives like XMLStreamReader.getElementText() followed by decode using, say, JakartaCommons' base64 codec. There are no numbers yet, but producing some will be one of high priority items on my "things to research for Blog" list.

Tuesday, September 08, 2009

Typed Access API tutorial, part III/a: binary data, client-side

(author's note: oh boy, this last piece of the "Typed Access API series" has been long coming -- apologies, and "better late than never")

Now that we have tackled most of the Stax2 Typed Access API (reading and writing simple values, arrays), let's consider the last remaining part: that of reading and writing base64-encoded binary data. For this installment, let's implement a simple web service that can be used for downloading files, as well as client to use that service.

Use of XML for such purpose may seem bit contrived, but there are other valid use cases for binary-in-xml (even if the example wasn't): for example, it may well make sense to embed small images (like icons), digital signatures, encryption keys and other non-textual data within documents. Sometimes convenience of inlining binary content within message is worth the modest overhead (base64 imposes +33% storage overhead, and similar processing overhead).
For example, in our example, we can embed multiple files with associated metadata quite easily without having to split the logical document. But both client and server can still handle files one-by-one with streaming interfaces, meaning that memory usage need not grow without bounds.

Finally, unlike many other xml processing packages, Woodstox does not cut corners when it comes to processing efficiency: base64 processing implementation is a significant improvement over using existing third-party base64 codes on other processing APIs (regular SAX, Stax or DOM).

So much for the philosophic part of why to use (or not to use) xml. Let's have look at a simple implementation to show binary content handling pieces that we need, along with a bit of glue to make example code work.
(note: source code is also accessible)

1. Message format

Here is the simple xml message format we will be using:

  <files>
<file name="test.jpg" checksumType="SHA">... base64 encoded content ...</file>
<checksum value="...base64 encoded hash of content..." />
<!-- ... and more files, if need be... -->
</files>

That is, a single message contains one or more files, each with associated checksum. Checksym is used to verify that contents were passed unmodified (as opposed to being corrupted by transfer). Simple but functional.

2. Client-side

So let's start with sample client code; code downloads bunch of files from the service (for now assuming URL determines set of files we'll get with some criteria).

For this example we will just use the regular http client that JDK comes equipped with (which actually works pretty well for many use cases -- for others, Jakarta httpclient is the cat's meow).
Full source code can be found at Woodstox SVN repository (under 'src/samples') but here's the interesting Client method:


public List<File> fetchFiles(URL serviceURL) throws Exception
{
  List<File> files = new ArrayList<File>();
URLConnection conn = serviceURL.openConnection(); conn.setDoOutput(false); // only true when POSTing conn.connect(); // note, should check 'if (conn.getResponseCode() != 200) ...' // Ok, let's read it then... (note: StaxMate could simplify a lot!) InputStream in = conn.getInputStream(); XMLStreamReader2 sr = (XMLStreamReader2) XMLInputFactory.newInstance().createXMLStreamReader(in); sr.nextTag(); // to "files" File dir = new File("/tmp"); // for linux... byte[] buffer = new byte[4000]; while (sr.nextTag() != XMLStreamConstants.END_ELEMENT) { // one more 'file' String filename = sr.getAttributeValue("", "name"); String csumType = sr.getAttributeValue("", "checksumType"); File outputFile = new File(dir, filename); FileOutputStream out = new FileOutputStream(outputFile); files.add(outputFile); MessageDigest md = MessageDigest.getInstance(csumType); int count; // Read binary contents of the file, calc checksum and write while ((count = sr.readElementAsBinary(buffer, 0, buffer.length)) != -1) { md.update(buffer, 0, count); out.write(buffer, 0, count); } out.close(); // Then verify checksum sr.nextTag(); byte[] expectedCsum = sr.getAttributeAsBinary(sr.getAttributeIndex("", "value")); byte[] actualCsum = md.digest(); if (!Arrays.equals(expectedCsum, actualCsum)) { throw new IllegalArgumentException("File '"+filename+"' corrupt: content checksum does not match expected"); } sr.nextTag(); // to match closing "checksum" } return files; }

Much of the code deals with connecting to the service; actual access is rather simple; only complexity comes from streamability of API (i.e. you read chunks of binary data, instead of reading the whole thing).

What is left, then, is the server side... which will follow shortly (I swear, won't take months this time)

Thursday, July 09, 2009

Are GAE developers a bunch of

ignorant, incompetent boobs... or what?

Usually I avoid ranting, at least on my blog entries. Thing is, negative output creates negative image: there is little positive in negativity. If you have nothing good to say, say nothing, and so on.

But sometimes enough is enough. This is the case with Google, and their pathetic attempts at Creating Java(-like) platforms.

1. Past failures: Android

In the past I have wondered at the clusterfuck known as Android: API is a mess, concoction of JDK pieces included (and mixed with arbitrary open source APIs and implementation classes) is arbitrary and incoherent. But since I don't really work much in the mobile space, I have just shook my head when observing it -- it's not really my problem. Just an eyesore.

But it is relevant in that it set the precedent for what to expect: despite some potentially clever ideas (regarding the lower level machinery), it all seems like a trainwreck, heading nowhere fast. And the only saving grace is that most mobile development platforms are even worse.

2. Current problems: start with ignorance

After this marvellous learning experience, you might expect that the big G would learn from its mistakes and get more things right second time around. No such luck: Google App Engine was a stillbirth; plagued by very similar problem as Android. Most specifically, significant portion of what SHOULD be available (given their implied goal of supporting all JDK5 pieces applicable to the context) was -- and mostly still is -- missing. And decisions again seem arbitrary and inconsistent; but probably made by different bunch of junior developers.

My specific case in point (or pet peeve) is the lack of Stax API on GAE (it is missing from white-list, which is needed to load anything within "javax." packages). It seems clear that this was mostly due to good old ignorance -- they just didn't have enough expertise in-house to cover all necessary aspects of JDK. Hey, that happens: maybe they have no XML expertise within the team; or whoever had some knowledge was busy farting around doing something else. Who knows? Should be easy to fix, whatever gave.

3. From ignorance to excuses

Ok: omission due to ignorance would be easily solved. Just add "javax.xml.stream" on the white list, and be done with that. After all, what could possibly be problematic with an API package? (we are not talking about bundling an implementation here)

But this is where things get downright comical: almost all "explanations" center around the strawman argument of "there must be some security-related issue here". I may be unfair here -- it is possible that all people peddling this excuse are non-Googlians (if so, my apologies to GAE team). But this is just very ridiculous (dare I say, retarded?) argument, because:

  1. Being but an API package, there is no functionality that could possibly have security implications (yes, l know exactly what is within those few classes -- the only actual code is for implementation discover, which was copied from SAX), and
  2. If there are problems with implementations of the API (which should be irrelevant, but humor me here), same problems would affect already included and sanctioned packages (SAX, DOM, JAXP, bundled Xerces implementation of the same)

Perhaps even worse, these "explanations" are served by people who seem to have little idea about package in question. I could as well ask about regular expression or image processing packages it seems.

4. Misery loves company

About the only silver lining here (beyond my not having to work on GAE...) is that there are other packages that got similarly hosed (I think JAXB may be one of those; and many open source libraries are affected indirectly, including popular packages like XStream). So hopefully there is little bit more pressure in fixing these flaws within GAE.

But I so hope that other big companies would consider implementing sand-boxed "cloudy" Java environments. Too bad competitors like Microsoft and Amazon tend to focus on other approaches: both doing "their own things", although those being very different from each other (Microsoft with their proprietary technology; Amazon focusing on offering low-level platform (EC2) and simple services (S3, SQS, SWF -- simple storage, queue, workflow service -- etc), but not managed runtime execution service.

Saturday, June 27, 2009

Woodstox, high impact factor & being #32 on Top Open Source Java libs list

Another interesting data point, this time from analysing Maven Dependency paths: "Most Referenced" list. Looks like Woodstox is quite widely used by projects that use or at least declare their dependencies using Maven: I assume magic number 1838 (which gives rank #32) could mean number of other projects depending on Woodstox. Not too shabby for an xml parser. Getting on the first result page is quite remarkable; especially considering that Woodstox ranks higher than many other worthy Java open source libraries like XStream, Hibernate, Quartz, Xalan and Velocity. And only slightly (by about 50% :-) ) trailing such ubiquitous thingy as Spring.

Although this is just one of way of estimating popularity of various (Java) OS libs, it is still interesting, because it has similarities to how scientific articles are ranked (impact factor; although here weights are uniform). And also since it could lend itself to Google PageRank style extensions as well... let's see.

Wednesday, June 17, 2009

Reading DOM documents using Stax XML parser, StaxMate

One of new features of StaxMate 2.0 is the ability to read DOM Documents (given a plain old Stax XMLStreamReader), and write DOM documents (using a Stax XMLStreamWriter). This is something no Stax parser (no, not even Woodstox!) provides, since it is in the "reverse" direction of what Stax implementation could support (reading DOM documents as Stax streams, or directing output of a stream writer into DOM document.

Functionality for converting to/from DOM is contained in class org.codehaus.staxmate.dom.DOMConverter.

To read DOM documents, you do:

  FileInputStream in = new FileInputStream("input.xml");
  XMLStreamReader sr = XMLInputFactory.newInstance().createXMLStreamReader(in);
// ... then do whatever processing (if any), and point to START_ELEMENT
// (or leave at START_DOCUMENT: that'll work too) Document doc = new DOMConverter().buildDocument(sr); in.close();

and to write DOM document:

  FileOutputStream out = new FileOutputStream("output.xml");
  XMLStreamWriter sw = XMLInputFactory.newInstance().createXMLStreamWriter(out);
// and output stuff, if need be... new DOMConverter().writeDocument(doc, sw); sw.close(); out.close();

Ok, so you can do it but why would you? Most commonly this is useful when there is need to use tree-based processing tools like XSL transformers, or access using using XPath. Ability to build smaller documents from sub-trees is crucial to limit memory usage and thereby improve performance (or make such usage possible at all).

So far this interoperability support is still quite limited; but with little bit of encouragement, following future features could be implemented:

  • Similar functionality for building JDOM trees (code actually exist, in old Woodstox "stax-utils" package, just need to clean up), and perhaps XOM, DOM4j. (for XOM, there is already NUX, however, that covers the use case)
  • Ability to directly bind things straight via StaxMate input cursors and output objects. This is an obvious improvement -- the main reason current functionality operates on "raw" Stax objects is just that code to do so existed; to use StaxMate objects, little bit more work is needed to ensure proper synchronization. One nicety from doing this would be ability to filter out non-text/non-element nodes (comments).

As usual, feel free to comment on this functionality, or join StaxMate mailing lists. I will also incorporate these code samples in StaxMate documentation page(s)i.

Tuesday, June 09, 2009

Faster, XML, Faster!

It appears that FasterXML -- the commercial support organization behind Jackson, Woodstox, StaxMate and Aalto) is debuting on Seattle Startup Scene: according to this survey, it is close to breaking into hotly contested Northwest Startups Top-300 list. :-)
In fact, one of our fellow up-and-comers, MarketOutsider (hi Bryce!) is within our sight with ranking north of 300 limit.

One of important next steps will be figuring out exact details of licensing for Aalto -- it is something that actually has lots of potential, even if it is bit of a uncut diamond right now. Its asynchronous (non-blocking) parsing specifically should be very useful for high-concurrency (thousands of concurrent connections) use cases. And being 2x as fast as Woodstox (essentially, as fast as fast C XML parsers!) is nice as well. Shaving off CPU cycles pays off if you pay by cycle (think EC2).

And beyond that, it would be good to get to build some of actual new products, from Hadoop-on-S3 processing systems to plug-n-play database front-end web services. And of course all the momentum Jackson has: maybe it'll work nicely with GWT in near future.
But more on these things when plans inch forward.

Sunday, April 12, 2009

JSON vs XML: confessions of a JSON advocate

First of all: before getting into the issue here, let me just say that this hurts me more than you (not to worry, there's no spanking for anyone) . This because I am about to confess some misgivings I am having with my favorite data format, JSON.

So, here I am: a fan of JSON as a data format. The problem is not that I didn't like what JSON is and has. The problem is with things that it has not. And I am almost ashamed to admit it but many of them are found in -- GASP! -- xml. Yes, it is bit sacrilegous to admit this. But it is true: while nothing that is in JSON is bad per se, there are things omitted that should not be.

So here's my brief post-Festivus Airing of Grievances regarding JSON.

1. Comments are not really optional for a textual data format!

Enough said. Every textual data format should have a way to embed human-readable unstructured notes, injectable by humans as well as systems transforming or generating content. I often use XML comments to include information about time when a document was generated, or to contain simple debug information. This is very handy, and harmless for automated processing as it can (and should) just ignore such comments. And for actual processing hints there are also XML processing instructions to use, likewise ignorable by processors who don't care about them.

JSON format almost had its comments, too: an earlier (pre-RFC) version actually did include comments: C and C++ styles I think (i.e. ones that Javascript uses).

Why they were removed is beyond me, and is in my opinion the biggest mistake made in specification. Comments just should be available.

2. Sometimes redundancy is Useful: case of Elements vs Attributes

(or, "Data is Lonely without Metadata")

It may be confusing to have 2 somewhat overlapping dimensions in XML: that of structured (nested) child elements, and unstructured element attributes. But there is one practical and useful way to separate the two: think of elements and their textual content as actual data, and attributes as metadata (for element data). This simple separation works surprisingly well; and is a useful distinction for use cases like data binding.

For example: type of an object can be stored in a type attribute (like, say, "xsi:type"), and field values commonly as child elements. Or store all identifiers as id attributes (like generic "xml:id" as per Xml:id specification), separate from data contained as elements and textual values stored in elements. But useful for adding references to the element sub-trees.

JSON has no such facility, so any metadata has to be either in-line mixed with data, or structured as siblings. Initially this may not seem like a big deal, but it gets confusing pretty quickly in practice.

So why doesn't this matter with actual (Java) Objects? Isn't JSON more "object oriented", being an object notation, not markup language? Well, ava Objects DO have metadata that is orthogonal to data (object state, i.e. its member fields)! What else is class information than metadata, separate from actual data? All that typing -- both class declarations, and runtime Object types -- is metadata, not data; similarly for all method information. And most obviouly the latest additional to class metadata, Java annotations, is pure orthogonal metadata. It is not a perfect analogy (class info is per-class, like static memebers and methods; whereas actual data is per-instance), but indicates the need of place for both data and metadata.

3. As Simple as Possible, but No Simpler

Although both of above paragraphs could be repeated here -- as in JSON being simplified beyond reasonable, by omitting comments -- there is more.

For example: unquoted linefeeds are not allowed within JSON String values; linefeeds must be quoted just like other control characters. This is Bad. Why are they not allowed to be included as is, given how common they are in text? I suspect it was done in effort to make it easier to "parse" JSON, by allowing single-line regexps to work. But I don't care -- if I parse something, I do it properly. Regexps alone do not parse make (they make lexer, useful and used by parsers, but not parsers). Linefeeds are displayable characters just like anything else. It's quite ok to let them be used within String values: after all, they are often needed there. So why force quoting them, even though they are not used as separators?

There are also things that I think are good or at least acceptable riddances: for example, while it is often useful to have choice of quotes in xml (single or double quotes), I'm not crying after loss of apostrophes. I could write a parser that handles multiple kinds of String value markers; but I can also generate content using just one kind. But it does complicate hand-writing and modifying content.

4. Is Ordering really irrelevant?

In XML content order is mostly significant; the only exception being attributes that are unordered. This makes some parts of data binding more challenging, because objects usually have no concept of ordering for properties. Because of this there are many legal easily definable XML structures that can not be easily be mapped to (Java) objects.

But while sometimes problematic, ordering can also be valuable. For example, it is great that it is possible to guarantee that certain elements (like, say, "header") comes before others (like, say, "footer"). The only conceptually correct way to do this in JSON is to use Lists (aka Arrays). But their values are anonymous, unlike those of Maps. Alternatively it is possible for JSON processors to preserve actual physical ordering; but the problem is that not all processors will do this; not the least because specification discourages this.

And the most obviously useful ordering is that the metadata (attributes) always precedes data (elements). That is something you can count on; and for common types of metadata (those class types and identifiers, see above), this is pretty optimal arrangement.

5. Other problems?

One thing of interest regarding list above is that none of them is a commonly stated reason by those who advocating using XML over JSON.

Conversely, I think that most commonly used reasons are very poor excuses of arguments; usually based on fundamental misunderstanding of actual benefits of XML, or good use cases for either XML or JSON. Perhaps I should collect list of such claims to shoot them down next. :-)

Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.