Monday, May 23, 2011

Starting with ZooKeeper

One of the tasks for command logging is adding support for log truncation. To truncate the log a non-blocking snapshot has to be taken and once the snapshot has been successfully completed at every node an event must be generated that notifies the command log at each node that it can truncate log data before the txn of the snapshot.

This is one of several somewhat asynchronous and somewhat background tasks that the system has to support. In the past we have implemented these things as a mix of stored procedures that are capable of running somewhat arbitrary code, or as arbitrary code dumped directly (or indirectly) into org.voltdb.ExecutionSite (online snapshot, rejoin, some stats) , or as standalone classes squirreled away somewhere (snapshot daemon).

The biggest hammer has always been the store procedure because it was a mechanism that allows for distributed coordination of work via the execution of plan fragments. The mechanism is ugly and verbose, but unambiguous (formalized, refactorable, improvable in aggregate) and understandable once you get the hang of reading them. The problem with stored procedures as a mechanism for distributed coordination is that they block the entire cluster for the duration of the procedure even if the thing you want to do isn't transactional, only has a small portion that needs to be transactional, or only involves a few entities in the cluster.

Without the stored procedure hammer you end up in a world of tightly correlated pain doing coordination via  the mailbox system. The mailbox system makes communication with local and remote processes consistent and it decides when a remote mailbox is no longer considered reachable.

Participants in any process always need to keep one eye open to see if another participant fails, and if a failure occurs the participants need to stop and agree on the failure and on the resulting state. The agreement is necessary because they may not have a consistent view of state from the failed participant(s) due to messages from the failed participant(s) only being delivered to a subset of the survivors.

Mailboxes also force a tight coupling between the publishers of state/intent and those interested in that state/intent. This tight coupling makes makes adding monitoring and manipulation by external processes more labor intensive and error prone.

If a process has a coordinator, and the coordinator of that process fails, then you need to elect a new coordinator for that process and resume where the previous one left off. That means acquiring the former coordinators state which means that state must be durable and replicated. If there are many kinds of state this can become verbose.

Mailboxes also have no persistence of any kind, which forces you to develop a separate durability mechanism for every piece of state you would like persisted beyond a restart or power failure.

For these reasons we decided that I should segue into ZooKeeper to see if we can simplify the implementation of new processes, and refactor existing ones. My first familiarization task is to modify VoltDB so that the initial catalog is stored and distributed with ZooKeeper, and to make the discovery and TCP mesh construction process coordinate via ZooKeeper.

Near as I can tell these are the things ZooKeeper gives you for free:
1. Atomic and durable publishing of state/intent, possible to build further isolation on top of that
2. Trivial leader election for any process via ephemeral nodes
3. Pub/sub messaging
4. Test and set, monotonically increasing auto-unique, no versioning though

You don't get much in the way of failure detection from ZooKeeper. Visibility to ZooKeeper is a requirement for any VoltDB node, but visibility to ZooKeeper does not guarantee visibility to all the other nodes in the system. The TCP mesh is something that a tightly coupled system, like a database that supports distributed transactions, require at all times. Any time a link fails the cluster has to stop and agree on membership. This means that ephemeral nodes in ZooKeeper are not totally sufficient for leader election unless you have an external mechanism for booting nodes that can't participate in the mesh.

I can see how ephemeral nodes would be more useful in a more loosely coupled systems where things can fail independently.

ZooKeeper has throughput and latency limitations so you still need separate links for messaging, and those won't guarantee atomic delivery across multiple links. This means processes that can't coordinate through ZooKeeper due to volume/latency still need to stop and agree on membership and transaction ordering in the event of a failure.

I am really enjoying working with ZooKeeper. It took 40 lines of code to wrap ZooKeeper for unit tests.

No comments:

Post a Comment