Tuesday, January 16, 2007

Woodstox 3.2 released

Ok, so this is not exactly fresh new, as the release happened last year (on December 28th). But better late than never.

1. What's new with Woodstox 3.2?

Since this is an incremental ("minor") release, it is fully functionally compatible (minus bug fixes) with earlier 3.0 and 3.1 releases. The only major new feature is that of implementing SAX2 interface. But that is a significant addition, due to its potential for legacy integration: Woodstox can now serve your SAX as well as Stax needs. SAX implementation is expected to be highly compatible, due to extensive testing using both full Nux regression test suite, and passing of SAXTest with over 99% success rate.

The only other piece of new functionality is addition of property WstxOutputProperties.P_OUTPUT_ESCAPE_CR (defaults to true), which can be used to enable/disable quoting of \r in output.

2. Faster Output

In addition to additional interoperability via SAX interface, the other major improvement is this on output side. Not only were all outstanding output-side bugs fixed (most of which were backported to 3.1 and 3.0 maintenance branches), but there were significant performance improvements as well. Big thanks to folks of Axis2 for their great suggestions!

3. Other fixes

And last but not least, most of outstanding issues were also resolved: from incorrect handling of base document reference for external entity expansion, to minor problems in Location (line number, character offset) update occuring with larger documents. For full list, check out Woodstox Jira.

4. Next Steps

So what's in store next with Woodstox? At this point, some API changes are needed to move things forward, as well as to allow some obsolete features to be removed. So, there will probably not be a 3.3 release before 4.0.

Another short-term task is to wean out StaxMate project: since it is being approved as a new main level Codehaus project, it can "leave the nest", and be removed from StaxMate Subversion repository. More news will be forthcoming regarding this event, in near future (in few weeks?). Stay tuned!

Monday, January 01, 2007

StaxMate based web service, Results, Part 2

(continuing the story of StaxMate based simple web service, appendix B)

Ok, so it took bit longer than just couple of days to test out a more realistic scenario. This was not due to complexity of the task itself, results were in more than a month ago... but editorial time was limited during latter part of the last year. ;-)

But back to the business. The setup used for the second set of tests is simple: server is still the same oldish PC, and client is a somewhat newer 3 Ghz Windows XP (home edition -- bad choice, but more on this later on) box. Connection used is 100 Mbps switched Ethernet, using a cheap 4-port Netgear switch. And on Jetty side, traditional (blocking) I/O is used, since it seems to have higher throughput for this use case.

1. Baseline

To keep the results comparable with the earlier ones, let's start by using Random number based UUID generation method, and use 10 client requests (i.e. up to parallel 10 requests at any given point in time).

This gives us baseline of about 2380 requests per second. So far so good -- this is almost 3 times as much as the local test case, and makes sense considering that client and server-side tasks are both reasonably CPU intensive.

2. Generation method, Random vs. Time

Since the time-based generation had lower overhead for the local test case, it would seem reasonable to expect some increase in throughput. And this is inn fact the case: throughput raises to 2680 rps. It is worth noting this is relatively higher increase; and once again, makes sense, considering that the whole test is now dominated on server-side processing, whereas earlier test also had client-side processing which is as fast for all generation methods.

3. Number of UUIDs generated per request

Using time-based generation as the new baseline (as with earlier test), let's see how batch size (number of UUIDs generated per request) affects results. As with earlier test, generating more UUIDs takes longer, both due to actual generation, and due to increased network overhead for longer response messages:

  • For 10 UUIDs, throughput is bit lower, 2250 rps (for response length of 598 bytes)
  • For 100 UUIDs, throuhgput drops much more, down to 800 rps (response length of 5278 bytes).

Results are once again quite what we expected, although the effects of longer response messages have even more profound effects. This would probably be even more significant for wide-area networks.

One thing that seemed worth investigating at this point, was to see whether increased request latency of the second test case could be alleviated by using more parallel requests. Testing with 20 concurrent request threads showed this not to be the case: throughput did not increase at all; even in this case. Perhaps the latencies involved at this point are still negligible (it is a high-speed LAN connection, and 6 k messages only still need 4 - 5 segments to transmit).

4. Parallelism (thread count)

Now that we got the basics covered, let's focus more on parallelism side. Unlike results so far, which should reflect results from the local test case, results with different thread counts could display different patterns. This because there is somewhat less threading overhead (no client-side processes involved on server-side), and because there is bit more latency due to actual network connections. So, let's see actual measured throughputs for various client thread counts (using time-based UUID generation, 1 UUID per request, BIO):

  • 1 thread: 1750
  • 2 threads: 2580
  • 4 threads: 2950
  • 8 threads: 2750
  • 20 threads: 2300 (but with very high fluctuation)
  • 50 threads: 1500 (even heavier fluctuation, 1000 - 2000)
  • 100 threads: 1400 (heavy fluctuation, 1000 - 1900)

So, with this setup, a little bit of parallelism does help out: about 4 client threads achieves maxiumum throughput. Throughput remains quite high for a few more threads, but after 20 threads or so, overhead starts to take its toll. Interestingly, the fluctuation in throughput also increases, making it harder to really pinpoint throughput (which could be resolved by more thorough statistical approach). This fluctuation is not entirely intuitive: while lowering throughput makes some sense, its instability does not. Perhaps it is just a symptom of problems that the statistics gathering thread has, due to less reliable timing.

5. GET vs. POST

But how about effects of using POST requests instead of GET? Turns out that (as guessed in the previous blog entry) its effects are minor on the server side: throughput with 4 client threads drops only by 100 requests, to 2850 rps.

6. HTTP 1.1, persistent connections

Another important HTTP property, that of being able to use persistent connections (which in the earlier article was incorrectly referred to as pipelining -- pipelining is unfortunately not implemented by most Java HTTP client packages, whereas basic connection reuse aka persistency is). It had very significant effect on the local test case, so it would seem likely to have an effect on this test case as well.

And effect it has: for the baseline 4 client connections, disabling persistent connections lowers throughput by about two thirds, down to 920 (!) rps. And just to see if the effects were mostly due to increased latency (of having to keep re-opening connections), another test was done with 8 client threads, with similar results (down to 850 rps, from earlier 2750). So this appears to be mostly due to pure processing overhead, even without having to consider network effects.

But there was one strange thing: after running for about a minute, connections started failing. Error messages indicated something about being unable to establish connections. After investigating this problem for a while, I learn that it was most likely caused by the default Windows XP (non-professional edition) settings for number of available so-called ephemeral port numbers. The default setting is ridiculously low, but can fortunately be changed. However, documentation indicate that although this setting can be changed, the home edition does suffer from a few networking related limitations, which may well render these home editions less than useful for testing high throughput web applications and services.

Now, it is also possible that above-mentioned problem could affect results for this test case. But it seems likely that even if the absolute throughput numbers might be higher when using proper networking OSes (like Linux, BSDs, or MacOS X), disabling persistent connections is likely to have significant negative effect on throughput.

7. NIO

At this point, the last significant performance trade-off involves choosing between Jetty's traditional blocking I/O (BIO) and newer asynchronous, NIO-based listeners (many other containers also offer similar choices).

But it turned out to be difficult to test NIO-based listener with the combination. For some reason, even after resolving Windows-related problems (which reappeared with this test), the baseline throughput was much lower than expected. So much so, that I went ahead and re-run some of earlier tests -- only to find that those baselines were reduced as well. This was peculiar: almost as if the network switch was having some problems (like had negotiated 10 Mbps link speed instead of 100 Mbps or so).

So, unfortunately I couldn't get a fair comparison between NIO and BIO for this setup. I would like to re-visit this test some time in future, but for now I assume that the performance differences should be along similar lines as in the local test case. That is, BIO would slightly outperform NIO, but that the difference would not be huge.

8. Next Steps?

At this point, I think I have done enough testing for this little web service. While it would be nice to figure out what the absolute highest throughput would be -- after all, with an earlier test (check out blog history for "Maximum TPS 10k" or such), it took more than one dedicated client to saturate the server -- it's time to move on with other things related to efficient message and xml processing.

Since it has been while since I have written about Woodstox and StaxMate, and especially since there have been interesting new developments, I will probably blog a little bit about these two things next.

Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.