3.x introduces a number of performance optimizations that help smooth out the performance curve in small clusters and applications where developers choose to use the database synchronously.
All the performance measurements taken for this post were taken using a single measurement per configuration. I had to run a large matrix and I didn't have time to automate or run the tests multiple times to get an average. The 2.x cluster was intentionally set up with NTP misconfigured to demonstrate typical user experience, and it was set up and torn down several times during the testing process so the clock skew varied. The synchronous tests aren't focused on extracting maximum throughput so concurrency was bounded to 25 client threads per server node.
You can click on the graph images to go to the spreadsheet and interactive version of the graph.
Packet pushing limits
The work I have done on 3.x has made me realize that the primary bottleneck for VoltDB is getting packets in and out of each node. You get a fixed number of packets and that number decreases slightly as packet size goes up, but however you divvy up the messaging workload that is the upper bound on how many discrete transactions you can do.The limit on packets also happens to be disturbingly low, especially in EC2 where getting more than 100-200k packet through a node is optimistic. I hear tell that VMWare is better at this, but that is something I need to test myself.
How you measure is significant, because micro-benchmarks will lie and tell you that you should be able to do much more. Once you put a full application on top and start running traffic from multiple places you will see a very different number. It's tempting to think your app is just not shared nothing enough, but you won't see the CPU utilization or contention metrics to support that.
TCP no delay also makes a huge different because it does a good job of coalescing messages, but makes it tricky to adequately support latency sensitive synchronous clients. Not using TCP no delay will increase the number of messages you can send over a socket from 2-9x. The win is much bigger proportionally when you are using Xen on EC2.
The most interesting finding I made is that no matter how much work I did to make VoltDB more shared nothing it all seemed to have no effect. Most of those things were slower in single threaded terms (but more scalable and non-blocking) and resulted in higher CPU utilization but no change in actual throughput.
Client affinity
The first feature I want to introduce is client affinity. Client affinity is the ability of a client library to route a transaction to the node containing the master of a partition when the transaction is a write, and to load balance across the master and replicas if the transaction is read. Client affinity can greatly reduce the number of packets the cluster has to move to service each transaction.
Client affinity requires the client library to poll the database for the stored procedure schema and topology of partitions across the cluster so that it knows how to hash the parameters to stored procedure invocations and once hashed where to route the transaction so that it arrives at the correct partition.
Currently the only client library to implement client affinity is the Java client library. It's a small amount of work to do it in the C++ client library. Idiomatic client libraries like PHP may be a bit more challenging because the API doesn't map well to a clustered database, but it is doable.
Currently the only client library to implement client affinity is the Java client library. It's a small amount of work to do it in the C++ client library. Idiomatic client libraries like PHP may be a bit more challenging because the API doesn't map well to a clustered database, but it is doable.
The following chart compares the performance of a 1-5 node cc1.4xlarge cluster running a 50/50 read write key value workload with a 30 byte key and 1000 byte payload. Every time a server node is added another client node is added. Synchronous clients are set to have 25 threads and there is a synchronous client per client node.

In the RO only workload you can see that client affinity yields a more than 2x improvement. Part of this comes from always routing requests to the correct place, and the other part comes from doubling the number of hosts serving reads via short circuit reads. The improvement continues to be beneficial for the 95/5 workload.
The next chart shows the performance with and without client affinity for a 50/50 read/write workload. The short circuit read optimization doesn't yield as big a benefit here, but it is still beneficial and routing writes directly to the master is still saving performance for 3.x. 2.x (not shown in this graph) on the other hand is peer to peer multi-master and isn't actually possible to route requests to the right place so 2.x tends to pay a request routing penalty all the time.
It's a small lie to say that you can't route to the right place in 2.x. You can route requests to the right place, but that results in worse latency due to the global ordering mechanism relying on constant traffic to propagate ordering information.


You can see that routing requests to the right place gives you that nice straight red line because there is no special case at 1 and 2 nodes where most requests are being routed to the correct node making the base case faster.
Client affinity + replication
With replication enabled it is possible to see a gain in throughput for reads because reads for a given key are now being served by two nodes. This required changes on the server to allow requests for local partitions to short circuit the master for the partition and be served directly by the replica. The following chart demonstrates the effect of load balancing reads using client affinity over different mixes of read/write load.
The next chart shows the performance with and without client affinity for a 50/50 read/write workload. The short circuit read optimization doesn't yield as big a benefit here, but it is still beneficial and routing writes directly to the master is still saving performance for 3.x. 2.x (not shown in this graph) on the other hand is peer to peer multi-master and isn't actually possible to route requests to the right place so 2.x tends to pay a request routing penalty all the time.
It's a small lie to say that you can't route to the right place in 2.x. You can route requests to the right place, but that results in worse latency due to the global ordering mechanism relying on constant traffic to propagate ordering information.
Synchronous clients and latency
Another focus for 3.x was making it possible for users to write applications that make synchronous calls to the database without totally destroying performance.
The following chart shows the performance of synchronous clients on the same 1-5 node k=0 cluster running the same key value workload.
2.x shows no performance increases because the clock skew is high and there is very little concurrency presented to the database. 3.x continues to add throughput at a consistent rate, but per node performance is much better with client affinity. The additional latency of misrouting requests is too much at the low level of concurrency being presented to the database.
Command logging
It's not demonstrated here because I didn't set up command logging. Setting up command logging on EC2 is labor intensive and time consuming because you have to prep the ephemeral disks and that can take an hour each time I bring up a cluster.
A key improvement for synchronous command logging is that writes that are blocked waiting for the command log to sync to disk will no longer block reads. This means lower latency for reads which is critical for synchronous clients.
3.x is able to perform much better without replication because transactions are routed directly to the node that is going to execute them resulting in no forwarding. 2.x has 4 in 5 chance that it is going to have to forward each transaction at 5 nodes.
The following chart shows performance of asynchronous clients for the same 1-5 node k=1 cluster running the 50/50 read/write workload.
3.x benefits from extra capacity by doing reads at replicas and not having to forward write transactions.
Compared to 2.x
The following chart shows the performance of asynchronous clients for the same 1-5 node k=0 cluster running the 50/50 read/write workload.
3.x is able to perform much better without replication because transactions are routed directly to the node that is going to execute them resulting in no forwarding. 2.x has 4 in 5 chance that it is going to have to forward each transaction at 5 nodes.
The following chart shows performance of asynchronous clients for the same 1-5 node k=1 cluster running the 50/50 read/write workload.
3.x benefits from extra capacity by doing reads at replicas and not having to forward write transactions.
Conclusion
For read-only or 95/5 workloads the performance difference between 2.x and 3.x is going to be dramatic. For the 50/50 workloads it isn't going to be quite as striking, but it is still nothing to sneeze at.
If you are using the database with limited concurrency or are failing to configure NTP correctly you should see a huge improvement in throughput and latency.
All these numbers are on EC2. I imagine the bare metal numbers are going to be different, perhaps strikingly so, but these are more reproducible.
You can see the full set of numbers and graphs in the spreadsheets I created. There is one for the 50/50 RW workload w/o replication, the 50/50 RW workload w/replication, and the write only workload w/replication. The write only sheet also contains the 5 node read-only workload comparison.
Nice stuff Ariel, these are impressive numbers.
ReplyDeleteThanks Tim. We went out on a limb not knowing what kind of improvement we were going to see and it really paid off. Wasn't cheap though. 6-9 months wall clock time most of that with two developers and up to four at the end.
DeleteLooks like some pretty nice performance improvements are coming in 3.X. One thing I was confused by: are reads in 3.X still getting put in the global transaction order? If not, is it possible that I might get a different result for my read depending on which replica it is sent to?
ReplyDeleteWrites are synchronously replicated. If you submit a read after a transaction has committed you will get the latest value no matter where the read is done.
DeleteEven with the global order if you submitted two transactions to read the same row to two different nodes you could get the read in the global order before the write. The only way to ensure the read occurs after the write is to submit both of them to the same node and that still works in V3.