StaxMate based web service, Results, Part 2

(continuing the story of StaxMate based simple web service, appendix B)

Ok, so it took bit longer than just couple of days to test out a more realistic scenario. This was not due to complexity of the task itself, results were in more than a month ago... but editorial time was limited during latter part of the last year. ;-)

But back to the business. The setup used for the second set of tests is simple: server is still the same oldish PC, and client is a somewhat newer 3 Ghz Windows XP (home edition -- bad choice, but more on this later on) box. Connection used is 100 Mbps switched Ethernet, using a cheap 4-port Netgear switch. And on Jetty side, traditional (blocking) I/O is used, since it seems to have higher throughput for this use case.

1. Baseline

To keep the results comparable with the earlier ones, let's start by using Random number based UUID generation method, and use 10 client requests (i.e. up to parallel 10 requests at any given point in time).

This gives us baseline of about 2380 requests per second. So far so good -- this is almost 3 times as much as the local test case, and makes sense considering that client and server-side tasks are both reasonably CPU intensive.

2. Generation method, Random vs. Time

Since the time-based generation had lower overhead for the local test case, it would seem reasonable to expect some increase in throughput. And this is inn fact the case: throughput raises to 2680 rps. It is worth noting this is relatively higher increase; and once again, makes sense, considering that the whole test is now dominated on server-side processing, whereas earlier test also had client-side processing which is as fast for all generation methods.

3. Number of UUIDs generated per request

Using time-based generation as the new baseline (as with earlier test), let's see how batch size (number of UUIDs generated per request) affects results. As with earlier test, generating more UUIDs takes longer, both due to actual generation, and due to increased network overhead for longer response messages:

For 10 UUIDs, throughput is bit lower, 2250 rps (for response length of 598 bytes)
For 100 UUIDs, throuhgput drops much more, down to 800 rps (response length of 5278 bytes).

Results are once again quite what we expected, although the effects of longer response messages have even more profound effects. This would probably be even more significant for wide-area networks.

One thing that seemed worth investigating at this point, was to see whether increased request latency of the second test case could be alleviated by using more parallel requests. Testing with 20 concurrent request threads showed this not to be the case: throughput did not increase at all; even in this case. Perhaps the latencies involved at this point are still negligible (it is a high-speed LAN connection, and 6 k messages only still need 4 - 5 segments to transmit).

4. Parallelism (thread count)

Now that we got the basics covered, let's focus more on parallelism side. Unlike results so far, which should reflect results from the local test case, results with different thread counts could display different patterns. This because there is somewhat less threading overhead (no client-side processes involved on server-side), and because there is bit more latency due to actual network connections. So, let's see actual measured throughputs for various client thread counts (using time-based UUID generation, 1 UUID per request, BIO):

1 thread: 1750
2 threads: 2580
4 threads: 2950
8 threads: 2750
20 threads: 2300 (but with very high fluctuation)
50 threads: 1500 (even heavier fluctuation, 1000 - 2000)
100 threads: 1400 (heavy fluctuation, 1000 - 1900)

So, with this setup, a little bit of parallelism does help out: about 4 client threads achieves maxiumum throughput. Throughput remains quite high for a few more threads, but after 20 threads or so, overhead starts to take its toll. Interestingly, the fluctuation in throughput also increases, making it harder to really pinpoint throughput (which could be resolved by more thorough statistical approach). This fluctuation is not entirely intuitive: while lowering throughput makes some sense, its instability does not. Perhaps it is just a symptom of problems that the statistics gathering thread has, due to less reliable timing.

5. GET vs. POST

But how about effects of using POST requests instead of GET? Turns out that (as guessed in the previous blog entry) its effects are minor on the server side: throughput with 4 client threads drops only by 100 requests, to 2850 rps.

6. HTTP 1.1, persistent connections

Another important HTTP property, that of being able to use persistent connections (which in the earlier article was incorrectly referred to as pipelining -- pipelining is unfortunately not implemented by most Java HTTP client packages, whereas basic connection reuse aka persistency is). It had very significant effect on the local test case, so it would seem likely to have an effect on this test case as well.

And effect it has: for the baseline 4 client connections, disabling persistent connections lowers throughput by about two thirds, down to 920 (!) rps. And just to see if the effects were mostly due to increased latency (of having to keep re-opening connections), another test was done with 8 client threads, with similar results (down to 850 rps, from earlier 2750). So this appears to be mostly due to pure processing overhead, even without having to consider network effects.

But there was one strange thing: after running for about a minute, connections started failing. Error messages indicated something about being unable to establish connections. After investigating this problem for a while, I learn that it was most likely caused by the default Windows XP (non-professional edition) settings for number of available so-called ephemeral port numbers. The default setting is ridiculously low, but can fortunately be changed. However, documentation indicate that although this setting can be changed, the home edition does suffer from a few networking related limitations, which may well render these home editions less than useful for testing high throughput web applications and services.

Now, it is also possible that above-mentioned problem could affect results for this test case. But it seems likely that even if the absolute throughput numbers might be higher when using proper networking OSes (like Linux, BSDs, or MacOS X), disabling persistent connections is likely to have significant negative effect on throughput.

7. NIO

At this point, the last significant performance trade-off involves choosing between Jetty's traditional blocking I/O (BIO) and newer asynchronous, NIO-based listeners (many other containers also offer similar choices).

But it turned out to be difficult to test NIO-based listener with the combination. For some reason, even after resolving Windows-related problems (which reappeared with this test), the baseline throughput was much lower than expected. So much so, that I went ahead and re-run some of earlier tests -- only to find that those baselines were reduced as well. This was peculiar: almost as if the network switch was having some problems (like had negotiated 10 Mbps link speed instead of 100 Mbps or so).

So, unfortunately I couldn't get a fair comparison between NIO and BIO for this setup. I would like to re-visit this test some time in future, but for now I assume that the performance differences should be along similar lines as in the local test case. That is, BIO would slightly outperform NIO, but that the difference would not be huge.

8. Next Steps?

At this point, I think I have done enough testing for this little web service. While it would be nice to figure out what the absolute highest throughput would be -- after all, with an earlier test (check out blog history for "Maximum TPS 10k" or such), it took more than one dedicated client to saturate the server -- it's time to move on with other things related to efficient message and xml processing.

Since it has been while since I have written about Woodstox and StaxMate, and especially since there have been interesting new developments, I will probably blog a little bit about these two things next.

Posted by Tatu Saloranta at Monday, January 01, 2007 10:58 PM
Categories: Java, XML/Stax
| Permalink |Comments | links to this post

CowTalk

Moo-able Type for Cowtowncoder.com

Monday, January 01, 2007

StaxMate based web service, Results, Part 2

Search

Last posts

Categories

Sponsored By

Archives

Related Blogs

Powered By

About me