Thursday, March 29, 2012

Jackson 2.0: CSV-compatible as well

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released"; or, for XML support, see "Not just for JSON any more -- also in XML")

Now that I talked about XML, it is good to follow up with another commonly used, if somewhat humble data format: Comma-Separated Values ("CSV" for friends and foes).

As you may have guessed... Jackson 2.0 supports CSV as well, via jackson-dataformat-csv project, hosted at GitHub

For attention-span-challenged individuals, checkout Project Page: it contains tutorial that can get you started right away.
For others, let's have a slight detour talking through design, so that additional components involved make some sense.

1. In the beginning there was a prototype

After completing Jackson 1.8, I got to one of my wishlist projects: that of being able to process CSV using Jackson. The reason for this is simple: while simplistic and under-specified, CSV is very commonly used for exchanging tabular datasets.
In fact, it (in variant forms, "pipe-delimited", "tab-delimited" etc) may well be the most widely used data format for things like Map/Reduce (Hadoop) jobs, analytics processing pipelines, and all kinds of scripting systems running on Unix.

2. Problem: not "self-describing"

One immediate challenge is that of lacking information on meaning of data, beyond basic division between rows and columns for data. Compared to JSON, for example, one neither necessarily knows which "property" a value is for, nor actual expected type of the value. All you might know is that row 6 has 12 values, expressed as Strings that look vaguely like numbers or booleans.

But then again, sometimes you do have name mapping as the first row of the document: if so, it represents column names. You still don't have datatype declarations but at least it is a start.

Ideally any library that supports CSV reading and writing should support different commonly used variations; from optional header line (mentioned above) to different separators (while name implies just comma, other characters are commonly used, such as tabs and pipe symbol) and possibly quoting/escaping mechanisms (some variants allow backslash escaping).
And finally, it would be nice to expose both "raw" sequence and high-level data-binding to/from POJOs, similar to how Jackson works with JSON.

3. So expose basic "Schema" abstraction

To unify different ways of defining mapping between property names and columns, Jackson now supports general concept of a Schema. While interface itself is little more than a tag interface (to make it possible to pass an opaque type-specific Schema instance through factories), data-format specific subtypes can and do extend functionality as appropriate.

In case of CSV, Schema (use of which is optional -- more on "raw" access later on) defines:

  1. Names of columns, in order -- this is mandatory
  2. Scalar datatypes columns have: these are coarse types, and this information is optional

Note that the reason that type information is strictly optional is that when it is missing, all data is exposed as Strings; and Jackson databinding has extensive set of standard coercions, meaning that things like numbers are conveniently converted as necessary. Specifying type information, then, can help in validating contents and possibly improving performance.

4. Constructing "CSV Schema" objects

How does one get access to these Schema objects? Two ways: build manually, or construct from a type (Class).

Let's start with latter, using same POJO type as with earlier XML example:


  public enum Gender { MALE, FEMALE };
  // Note: MUST ensure a stable ordering; either alphabetic, or explicit
  // (JDK does not guarantee order of properties)
  @JsonPropertyOrder({ "name", "gender", "verified", "image" })
   public class User {
   public Gender gender;
   public String name;
   public boolean verified;
   public byte[] image;
  }
// note: we could use std ObjectMapper; but CsvMapper has convenience methods CsvMapper mapper = new CsvMapper(); CsvSchema schema = mapper.schemaFor(User.class);

or, if we wanted to do this manually, we would do (omitting types, for now):


  CsvSchema schema = CsvSchema.builder()
.addColumn("name") .addColumn("gender")
.addColumn("verified")
.addColumn("image")
.build();

And there is, in fact, the third source: reading it from the header line. I will leave that as an exercise for readers (check the project home page).

Usage is identical, regardless of the source. Schemas can be used for both reading and writing; for writing they are only mandatory if output of the header line is requested.

5. And databinding we go!

Let's consider the case of reading CSV data from file called "Users.csv", entry by entry. Further, we assume there is no header row to use or skip (if there is, the first entry would be bound from that -- there is no way for parser auto-detect a header row, since its structure is no different from rest of data).

One way to do this would be:


  MappingIterator<Entry> it = mapper
.reader(User.class)
.with(schema)
.readValues(new File("Users.csv"());
List<User> users = new ArrayList<User>();
while (it.hasNextValue()) {
User user = it.nextValue();
// do something?
list.add(user);
}
// done! (FileReader gets closed when we hit the end etc)

Assuming we wanted instead to write CSV, we would use something like this. Note that here we DO want to add the explicit header line for fun:


  // let's force use of Unix linefeeds:
ObjectWriter writer = mapper
.writer(schema.withLineSeparator("\n"));
writer.writeValue(new File("ModifiedUsers.csv"), users);

one feature that we took advantage of here is that CSV generator basically ignores any and all array markers; meaning that there is no difference whether we try writing an array, List or just basic sequence of objects.

6. Data-binding (POJOs) vs "Raw" access

Although full data binding is convenient, sometimes we might just want to deal with a sequence of arrays with String values. You can think of this as an alternative to "JSON Tree Model"; an untyped primitive but very flexible data structure.

All you really have to do is to omit definition of the schema (which will then change observe token sequence); and make sure not to enable handling of header line
For this, code to use (for reading) looks something like:


  CsvMapper mapper = new CsvMapper();
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues( "1,null\nfoobar\n7,true\n");
Object[] data = it.nextValue();
assertEquals(2, data.length);
// since we have no schema, everything exposed as Strings, really
assertEquals("1", data[0]);
assertEquals("null", data[1]);

Finally, note that use of raw entries is the only way to deal with data that has arbitrary number of columns (unless you just want to add maximum number of bogus columns -- it is ok to have less data than columns).

7. Sequences vs Arrays

One potential inconvenience with access is that by default CSV is exposed as a sequence of "JSON" Objects. This works if you want to read entries one by one.

But you can also configure parser to expose data as an Array of Objects, to make it convenient to read all the data as a Java array or Collection (as mentioned earlier, this is NOT required when writing data, as array markers have no effect on generation).

I will not go into details, beyond pointing out that the configuration to enable addition "virtual array wrapper" is:


mapper.ensable(CsvParser.Feature.WRAP_AS_ARRAY);

and after this you can bind entries as if they came in as an array: both "raw" ones (Object[][]) and typed (List<User> and so on).

8. Limitations

Compared to JSON, CSV is more limited data format. So does this limit usage of Jackson CSV reader?

Yes. The main limitation is that column values need to essentially be scalar values (strings, numbers, booleans). If you do need more structured types, you will need to work around this, usually by adding custom serializers and deserializers: these can then convert structured types into scalar values and back. However, if you end up doing lots of this kind of work, you may consider whether CSV is the right format for you.

9. Test Drive!

As with all the other JSON alternatives, CSV extension is really looking forward to more users! Let us know how things work.

blog comments powered by Disqus

Sponsored By


Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.