Our FE Tim Callaghan recently got around to benchmarking recovery performance with the voter sample application. While it was running he asked me what I thought the throughput was going to be. Based on the performance of single node (all on one host) unit tests that xferred > 200 megabytes/sec I said 80 megabytes/sec. The results came in at 3.6 megabytes/sec. Not good.
The first thing I did was crank up the parallelism to see if that would be enough to saturate gig-E. Recovering 6 partitions at once tooled along at 20 megabytes/sec. Also not good enough.
I inserted some timing statements at the recovering partition and source partition to find out where time was being spent. The source partition spent almost no time serializing data. The recovering partition spent almost all of its time loading data into the execution engine. %8.3 of the time not spent in the execution engine was spent copying data from the Java heap to the native heap. I broke out the stupendously awesome Google profiler and ran it on the recovering partition.
I'll stop here and say that if you worked on the Google profiler, and you are in Boston, you should stop by and I will give you the gift of beer. Same goes for Julian Seward.
The following excerpts show parts of the profile output while recovering a single partition. The full profile is available with 1 partition and 6 partitions.
This chunk shows that quite a bit of time is being spent rehashing the table. A complete waste of time considering that the number of entries is known in advance.
This chunk shows a large amount of time spent traversing tree indexes. It is strange that the default index for a materialized view is a tree. VoltDB opts for hash indexes as the default index type in other areas for performance reasons. I created ENG-738 to track this issue.
Next I tried to see how hard it would be to grab some low hanging fruit and improve things. Optimizing the hash index was as easy as sending over the total tuple count and resizing it once at the start.
On the other hand the std::map/red black tree was a tougher nut to crack. Right now we don't support a wide variety of tree indexes that are specialized for specific data types, so any tree index needs to be templated to support arbitrary key types. A pointer to key storage is not sufficient because it involves extra cache misses. We have tried stx::btree in the past, but didn't see a big performance improvement when running TPC-C. We stopped using it because it doesn't work as a multi-map out of the box.
I brought stx::btree_map back for the unique tree map. In the future I think something like a Judy array would be a good addition for small integer to integer indexes like the ones used in the voter schema.
I also modified the IO so that the source partition and recovering partition have a dedicated TCP socket and use DirectByteBuffers to eliminate copies during messaging.
I ran another benchmark and with the new configuration. 30 megabytes/sec with 6 partitions and 6.3 megabytes/sec with one partition. Oh joy. Going back to the profiler I found that the distribution of time hadn't changed much. More useful work was being done thanks to the more efficient tree structure and less time spent copying/rehashing data, but still not enough to saturate gig-E. Here is the full profile with 1 partition and 6 partitions.
The voter schema has a single large table called "votes" table containing a BIGINT phone number and TINYINT contestant number. A little math reveals that 30 megabytes of this tuple data contains 3,495,253 rows. So that is 6,990,506 tree index operations with 10,485,759 index operations. It doesn't sound so bad when you look at it that way. I added a VARCHAR(32) to the schema and reran with 6 partitions and sure enough it streamed well over 80 megabytes/sec. Testing with TPC-C schema and data resulted in throughput over 100 megabytes/sec.
Apparently just because something is in memory doesn't mean it is arbitrarily fast. IO is IO whether you are going to memory or disk. Short of bending the laws of time and space, I can't see making this operation (with random data) any faster, but I would love to be proved wrong. Are there any sorted indexes that optimize bulk loading?
Statements and opinions presented here are not those of VoltDB Inc. unless specifically presented as such.


No comments:
Post a Comment