Despite the intentions of the sharded shared nothing design there was shared state in several places and deep rooted assumptions that a single instance of some piece of state existed on each node. The singleton aspect of these assumptions was core to the global ordering process, replication, and failure detection/resolution which made it very hard to correct incrementally. Additionally the global ordering process from H-Store required n2 heartbeating between all nodes in the cluster and that placed hard limits on latency and cluster size.
Part of this rewrite is a re-factoring of the cluster management, messaging fabric, and NIO libraries into an independent package we have been calling Volt Core.
The other half of the re-factoring is a rewrite of how single and multi-partition transactions are coordinated and replicated. We are moving from the global ordering mechanism used in H-STORE to a simpler sharded master/slave model with leader election. The initiator works is being done on the iv2-wip branch and it includes the work done for Volt Core however Volt Core will ship first. The branch has been deleted since this blog post was first written.
The goals of this rewrite are:
- Make new features easier to develop and make existing functionality easier to maintain
- Scale up to an arbitrary number of sockets and cores and better saturate 10-gig E
- Scale out to hundreds of nodes instead of tens of nodes
- Reduce latency to the hundreds of microseconds range and eliminate NTP as a dependency
Volt Core
Volt core is a re-factoring of our homegrown messaging fabric, NIO library, and embedded ZooKeeper into a standalone functionality that operates independently from all the things that make VoltDB a sharded SQL database.
The NIO library was rewritten to use a shared nothing selector per thread model instead of ye olden circa 2001 reactor model. Sockets are assigned to a single thread + selector so that network threads do not have to share state, and state associated with a specific socket does not have to traverse threads. Statistics are collected locally in each thread and gathered by sending an event requesting the stats be collected.
Message serialization was rewritten to correctly calculate the size of each message before allocating memory instead of using the doubling algorithm that FastSerializer uses.
Previously the embedded ZooKeeper was not available at startup because it require that the messaging fabric and the majority of the cluster topology to be set up. The problem was that we special cased startup and there was a lot of voodoo in SocketJoiner/HostMessenger to set up the TCP mesh that was also responsible for syncing a bunch of arbitrary state using hand coded binary serialization.
Another issue with ZooKeeper was that it was bolted on after failure detection had been implemented and failure detection was never re-written on top of ZooKeeper. In fact ZooKeeper required the existing failure detection mechanism to be notified of failures so it could update the state of ephemeral nodes. This made it impossible to use ZooKeeper for failure detection and to a certain extent discovery.
HostMessenger is a clearing house for most of the functionality in Volt Core. It contains a pool of network threads, the embedded ZooKeeper server and client, and SocketJoiner. SocketJoiner is always running and ready to accept new nodes into the cluster.
The leader node specified by IP address bootstraps it's embedded ZooKeeper instance and makes no attempt to connect to itself using SocketJoiner. The rest of the nodes use their SocketJoiner to connect to the leader which advertises the address of all cluster members and waits for the new node to finish setting up the mesh, and then kicks off a ZooKeeper transaction to bring the new embedded ZooKeeper server into the cluster.
Nodes that are part of the cluster have access to the mailbox system and ZooKeeper. Mailboxes can be constructed without any distributed coordination. One of the more painful changes was making site ids 8-byte longs that contain a 4-byte host id and a 4-byte site id. The host id can be used to route a message to the correct host without any knowledge of whether the site exists on that host. The advantage of adding the extra bytes is that IDs for sites and hosts are no longer reused which simplifies failure and rejoin as well as persistence mechanism that log failure information to disk.
Now that ZooKeeper is available at startup we use it for discovery and failure detection. MailboxPublisher is used to publish all mailbox information to a single ephemeral ZK node. Publishing to a single ZK node causes nodes to "fail" atomically so it isn't possible to see a some of the entities on a node fail while others are still live.
MailboxTracker does the work of monitoring the nodes and their contents. The process is complicated by the fact that ZK watches need to be reset carefully. If you aren't careful you end up with multiple watches set on a single node and each watch will be fire for a single event. Additionally, node deletion results in the data watch of the deleted node firing as well as the child watch of the parent node resulting in two events for what is really one event from the perspective of the application.
MailboxTracker passes updates to a callback in RealVoltDB that constructs a SiteTracker as well as generating failure/rejoin events when nodes come/go. SiteTracker is an immutable POJO representation of all the publicly reachable mailboxes in the cluster. Most things that use SiteTracker use a cached reference so that the topology of the cluster remains consistent until failure/rejoin processing is explicitly reached.
MailboxTracker does the work of monitoring the nodes and their contents. The process is complicated by the fact that ZK watches need to be reset carefully. If you aren't careful you end up with multiple watches set on a single node and each watch will be fire for a single event. Additionally, node deletion results in the data watch of the deleted node firing as well as the child watch of the parent node resulting in two events for what is really one event from the perspective of the application.
MailboxTracker passes updates to a callback in RealVoltDB that constructs a SiteTracker as well as generating failure/rejoin events when nodes come/go. SiteTracker is an immutable POJO representation of all the publicly reachable mailboxes in the cluster. Most things that use SiteTracker use a cached reference so that the topology of the cluster remains consistent until failure/rejoin processing is explicitly reached.
Initation V2
If you aren't already familiar with how Volt initiates and replicates stored procedures you should read http://voltdb.com/company/blog/transaction-ordering-and-replication. The problem with maintaining a global order is that transactions can't be executed until the global order is known, and the way this is done is direct messaging. In an idle cluster heartbeats drive the global ordering process and the heartbeat interval places a lower bound on latency. The global ordering mechanism also required full replication before a transaction could be executed further increasing the latency penalty as well as requiring additional messaging because heartbeats required responses.
With enough load latency wasn't terrible, but many workloads don't present enough load and as you scale up cluster size the frequency with which nodes communicate starts to approach the worst case of the heartbeat interval.
The V2 version of transaction initiation is basic master slave replication for single partition transactions. Every partition has a master where transactions for that partition are initiated. Transaction ordering for the partition is established by virtue of having a single node initiating the transactions for that partition. Because ordering is established unilaterally the transaction can be executed immediately at the master and its slaves and the response can be forwarded immediately to the client.
If a partition master fails a new master is elected and the new master + surviving replicas must agree on the last transaction from the old master, and any transactions not present at all replicas must be re-replicated before execution continues. One key difference here is that store procedure invocations must be retained by slaves until the master tells them that the transaction has been fully replicated.
Another option is to retain undo information until the master reports the transaction as fully replicated and roll back in the event of partial replication, but that is slightly messier because the client may still require a response even if the response is failure. It also isn't much simpler to retain the undo log vs. retaining the invocation.
Multi-partition transactions are now initiated by a single multi-partition initiator. Notice a pattern here? Single master, the only way to fly. Because there is a single multi-partition initiator it doesn't matter where the multi-partition transaction is scheduled at each partition master because only one multi-partition transaction can run at a time and it is commutative WRT to single partition transactions.
If a participant in a multi-part transaction fails it is up to the multi-part initiator to resolve the failure by ceasing to expect a fragment response from the failed participant. If the multi-part initiator fails the cluster must agree on the fate of the most recent multi-partition transaction and a replacement multi-part initiator must be elected.
Hi Ariel!
ReplyDeleteThanks a lot for the write up! Looks like Volt is shedding one of the magic ingredients with the shared-nothing parallel transaction ordering: why is it being replaced, from your post I am understanding it is to reduce latency?
Best,
Henning
Hi,
DeleteWe are moving towards something that is more shared nothing and parallel than what we have now. Right now multi-part and single part transactions are ordered together via a peer to peer agreement protocol. This protocol requries many to many communication thousands of times a second and yields high latency considering that all the data is in memory.
There is no reaso to order them together since single part transactions are commutative with multi-part transactions.
With iV2 single partition transations no longer pay a distributed coordination penalty and multi-partition transactions don't pay a coordination penalty either because there is a single coordinator. It turns out that making a single master redundant is much easier then attaining redudancy through convoluted peer to peer protocols. Single masters can also make unilateral decisions without agreement which yields much lower latency.
Ariel
Thanks for clarifying!
ReplyDeleteCan you help me getting my head around how single- and multi-part can be commutative? Is there an implication that they will only ever affect disparate parts of the data?
What makes the master the master, that it is the one node that can authoritatively confirm a write? And supply a read without asking other nodes?
Henning