(continuing the story of StaxMate based simple web service, appendix B)
Ok, so it took bit longer than just couple of days to test out a more
realistic scenario. This was not due to complexity of the task itself,
results were in more than a month ago... but editorial time was limited
during latter part of the last year. ;-)
But back to the business. The setup used for the second set of tests is
simple: server is still the same oldish PC, and client is a somewhat
newer 3 Ghz Windows XP (home edition -- bad choice, but more on this
later on) box. Connection used is 100 Mbps switched Ethernet, using a
cheap 4-port Netgear switch. And on Jetty side, traditional (blocking)
I/O is used, since it seems to have higher throughput for this use case.
1. Baseline
To keep the results comparable with the earlier ones, let's start by
using Random number based UUID generation method, and use 10 client
requests (i.e. up to parallel 10 requests at any given point in time).
This gives us baseline of about 2380 requests per second. So far
so good -- this is almost 3 times as much as the local test case, and
makes sense considering that client and server-side tasks are both
reasonably CPU intensive.
2. Generation method, Random vs. Time
Since the time-based generation had lower overhead for the local test
case, it would seem reasonable to expect some increase in throughput.
And this is inn fact the case: throughput raises to 2680 rps. It
is worth noting this is relatively higher increase; and once again,
makes sense, considering that the whole test is now dominated on
server-side processing, whereas earlier test also had client-side
processing which is as fast for all generation methods.
3. Number of UUIDs generated per request
Using time-based generation as the new baseline (as with earlier test),
let's see how batch size (number of UUIDs generated per request) affects
results. As with earlier test, generating more UUIDs takes longer, both
due to actual generation, and due to increased network overhead for
longer response messages:
-
For 10 UUIDs, throughput is bit lower, 2250 rps (for response
length of 598 bytes)
-
For 100 UUIDs, throuhgput drops much more, down to 800 rps
(response length of 5278 bytes).
Results are once again quite what we expected, although the effects of
longer response messages have even more profound effects. This would
probably be even more significant for wide-area networks.
One thing that seemed worth investigating at this point, was to see
whether increased request latency of the second test case could be
alleviated by using more parallel requests. Testing with 20 concurrent
request threads showed this not to be the case: throughput did not
increase at all; even in this case. Perhaps the latencies involved at
this point are still negligible (it is a high-speed LAN connection, and
6 k messages only still need 4 - 5 segments to transmit).
4. Parallelism (thread count)
Now that we got the basics covered, let's focus more on parallelism
side. Unlike results so far, which should reflect results from the local
test case, results with different thread counts could display different
patterns. This because there is somewhat less threading overhead (no
client-side processes involved on server-side), and because there is bit
more latency due to actual network connections. So, let's see actual
measured throughputs for various client thread counts (using time-based
UUID generation, 1 UUID per request, BIO):
-
1 thread: 1750
-
2 threads: 2580
-
4 threads: 2950
-
8 threads: 2750
-
20 threads: 2300 (but with very high fluctuation)
-
50 threads: 1500 (even heavier fluctuation, 1000 - 2000)
-
100 threads: 1400 (heavy fluctuation, 1000 - 1900)
So, with this setup, a little bit of parallelism does help out: about 4
client threads achieves maxiumum throughput. Throughput remains quite
high for a few more threads, but after 20 threads or so, overhead starts
to take its toll. Interestingly, the fluctuation in throughput also
increases, making it harder to really pinpoint throughput (which could
be resolved by more thorough statistical approach). This fluctuation is
not entirely intuitive: while lowering throughput makes some sense, its
instability does not. Perhaps it is just a symptom of problems that the
statistics gathering thread has, due to less reliable timing.
5. GET vs. POST
But how about effects of using POST requests instead of GET? Turns out
that (as guessed in the previous blog entry) its effects are minor on
the server side: throughput with 4 client threads drops only by 100
requests, to 2850 rps.
6. HTTP 1.1, persistent connections
Another important HTTP property, that of being able to use persistent
connections (which in the earlier article was incorrectly referred to as
pipelining -- pipelining is unfortunately not implemented by most Java
HTTP client packages, whereas basic connection reuse aka persistency
is). It had very significant effect on the local test case, so it would
seem likely to have an effect on this test case as well.
And effect it has: for the baseline 4 client connections, disabling
persistent connections lowers throughput by about two thirds, down to 920
(!) rps. And just to see if the effects were mostly due to increased
latency (of having to keep re-opening connections), another test was
done with 8 client threads, with similar results (down to 850 rps,
from earlier 2750). So this appears to be mostly due to pure processing
overhead, even without having to consider network effects.
But there was one strange thing: after running for about a minute,
connections started failing. Error messages indicated something about
being unable to establish connections. After investigating this problem
for a while, I learn that it was most likely caused by the default
Windows XP (non-professional edition) settings for number of available
so-called ephemeral port numbers. The default setting is ridiculously
low, but can fortunately be changed. However, documentation indicate
that although this setting can be changed, the home edition does suffer
from a few networking related limitations, which may well render these
home editions less than useful for testing high throughput web
applications and services.
Now, it is also possible that above-mentioned problem could affect
results for this test case. But it seems likely that even if the
absolute throughput numbers might be higher when using proper networking
OSes (like Linux, BSDs, or MacOS X), disabling persistent connections is
likely to have significant negative effect on throughput.
7. NIO
At this point, the last significant performance trade-off involves
choosing between Jetty's traditional blocking I/O (BIO) and newer
asynchronous, NIO-based listeners (many other containers also offer
similar choices).
But it turned out to be difficult to test NIO-based listener with the
combination. For some reason, even after resolving Windows-related
problems (which reappeared with this test), the baseline throughput was
much lower than expected. So much so, that I went ahead and re-run some
of earlier tests -- only to find that those baselines were reduced as
well. This was peculiar: almost as if the network switch was having some
problems (like had negotiated 10 Mbps link speed instead of 100 Mbps or
so).
So, unfortunately I couldn't get a fair comparison between NIO and BIO
for this setup. I would like to re-visit this test some time in future,
but for now I assume that the performance differences should be along
similar lines as in the local test case. That is, BIO would slightly
outperform NIO, but that the difference would not be huge.
8. Next Steps?
At this point, I think I have done enough testing for this little web
service. While it would be nice to figure out what the absolute highest
throughput would be -- after all, with an earlier test (check out blog
history for "Maximum TPS 10k" or such), it took more than one dedicated
client to saturate the server -- it's time to move on with other things
related to efficient message and xml processing.
Since it has been while since I have written about Woodstox and
StaxMate, and especially since there have been interesting new
developments, I will probably blog a little bit about these two things
next.