Wednesday, October 28, 2009

Data Format anti-patterns: converting between secondary artifacts (like xml to json)

One commonly asked but fundamentally flawed question is "how do I convert xml to json" (or vice versa).
Given frequency at which I have encountered it, it probably ranks high on list of data format anti-patterns.

And just to be clear: I don't mean that there is any problem in having (or wanting to have) systems that produce data using multiple alternative data formats (views, representations). Quite on contrary: ability to do so is at core of REST(-like) web services, which are one useful form of web services. Rather, I think it is wrong to convert between such representations.

1. Why is it Anti-pattern?

Simply put: you should never convert from secondary (non-authoritative) representation into another such representation. Rather, you should render your source data (which is usually in relational model, or objects) into such secondary formats. So: if you need xml, map your objects to xml (using JAXB or XStream or what you have); if you need JSON, map it using Jackson. And ditto for the reverse direction.

This of course implies that there are cases where such transformation might make sense: namely, when your data storage format is XML (Native Xml DBs) or Json (CouchDB). In those cases you just have to worry about the practical problem of model/format impedance, similar to what happens when doing Object-Relational Mapping (ORM).

2. Ok: simple case is simple, but how about multiple mappings?

Sometimes you do need multi-step processing; for example, if your data lives in the database. Following my earlier suggestion, it would seem like you should convert directly from relational model (storage format) into resulting transfer format (json or xml). Ideally, yes: if there are such conversions. But in practice it is more likely that a two-phase mapping (ORM from database to objects; and then from objects to xml or json) works better: mostly because there are good tools for separate phases, but fewer that would do the end-to-end rendition.

Is this wrong? No. To understand why, it is necessary to understand 3 classes of formats that are talking about:

  • Persistence (storage) format, used for storing your data: usually relational model but can be something else as well (objects for object DBs; XML for native XML databases)
  • Processing format: Objects or structs of your processing language (POJOs for Java) that you use for actual processing. Occasionally this can also be something more exotic; like XML when using XSLT (or relational data for complicated reporting queries)
  • Transfer format: Serialization format used to transfer data between end points (or sometimes time-shifting, saving state over restart); may be closely bound to processing format (as is the case for Java serialization)

So what I am really saying is that you should not transfer within a class of formats; in this case between 2 alternate transfer formats. It is acceptable (and often sensible) to do conversions between classes of formats; and sometimes doing 2 transforms is simpler than trying to one bigger one. Just not within a class.

3. Three Formats may be simpler than Just One

One more thing about above-mentioned three formats: there is also a related fallacy of thinking that there is a problem if you are using multiple formats/models (like relational model for storage, objects for processing and xml or json for transfer). Assumption is that additional transformations needed to convert between representations is wasteful enough to be a problem in and of itself. But it should be rather obvious why there are often distinct models and formats in use: because each is optimal for specific use case. Storage format is good for, gee, storing data; processing model good for efficiently massaging data, and transfer format good for piping it through the wire. As long as you don't add gratuitous conversions in-between, transforming on boundary is completely sensible; especially considering alternative of trying to find a single model that works for all cases. One only needs to consider case of "XML for everything" cluster (esp. XML for processing, aka XSLT) to see why this is an approach that should be avoided (or, Java serialization as transfer format -- that is another anti-pattern in and of itself).

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.