Wednesday, May 25, 2011

At an impasse with ZooKeeper

Three days with ZooKeeper hasn't made me an expert so chime in if I have misunderstood how ZK works.

ZooKeeper would be a fantastic fit for VoltDB if we could ask VoltDB users to provide a working ZK cluster. Unfortunately we can't make VoltDB any more difficult to use and deploy. Any plan to integrate ZooKeeper into VoltDB in a way that is invisible to users will run into one of several issues.

VoltDB allows you to replace any failed node with a new node. ZooKeeper requires that you replace a failed node with a new node with the same address. Theoretically you can mask this with DNS, but it isn't  possible for VoltDB to do those DNS manipulations for the user. See https://issues.apache.org/jira/browse/ZOOKEEPER-107

Operationally it is possible to massage a ZooKeeper cluster into accepting a node with a new address and it is even possible to expand a ZooKeeper cluster, but the process is not well documented and involves bouncing several node several times with several different configs. I plumbed the mailing list and found a description of two ways to do it, but I am not confident it is something I could automate correctly in the face of concurrent failures. After failures we would have to pause the cluster and rebuild ZooKeeper for the new cluster topology (all without the aid of ZooKeeper).

Another issue is that ZooKeeper can't tolerate failures in a two nodes cluster while VoltDB can. If a node fails in a two node cluster then the blessed node will continue (if it wasn't the failed node) or the cluster will shut down. We would need ZooKeeper to match this behavior which entails the rebuild process from the previous paragraph.

The last issue is that ZooKeeper requires a quorum, and not a quorum of the available nodes, but a quorum of the original nodes when the ZooKeeper cluster was started. With a k=2 VoltDB cluster it is possible to lose 2 of the 3 nodes and still continue operating if the nodes don't fail at the same time. When the first node fails the two survivors agree to boot the third, and when the 2nd node fails the blessed one would be the one to continue operating.

I am sure that with enough hacking we could make ZK do our bidding, but it is starting to seem like implementing a simpler ephemeral agreement service might be easier. Right now I am looking into what parts of ZooKeeper can be reused for this. It seems like replacing QuorumPeer with something based on the VoltDB agreement model might work and would provide the same guarantees in the face of failure. I need to finish understanding QuorumPeer before I can decide if I should enhance it to support what we need or replace it. It may be that adding some smarts for new peers and replacing the leader election algorithm is enough.

Interestingly, ZooKeeper is very similar to a single partition many replica VoltDB when it comes to its durability strategy. ZK has in-memory storage that is not designed to swap, a command log that records mutations, and occasional snapshots of the entire in-memory structure to truncate the command log. VoltDB does the exact same thing with the addition of SQL, stored procedures, distributed txns, and a different agreement model. ZK uses a quorum approach to improve latency in the face of inconsistently performing nodes, but that is trading off space for speed which wouldn't work for VoltDB (txns must be fully replicated) where the expectation is that you will be storing more than just metadata.

Improved latency isn't a priority for VoltDB's agreement model because latency inside a single partition stored procedure is measured in microseconds. Instead it optimizes for throughput, cluster size, and synchronous replication with the expectation that the latency of invoking a single stored procedure is equivalent to or better than the latency of the multiple round trips required when not using a stored procedure.

No comments:

Post a Comment