Time to come up for air. I finally got all the ZooKeeper integration done. There was a disaster involving tabs in the ZooKeeper code that required some rebasing and during that process revision 1880 got merged with the revision that was supposed to follow it so the commit message isn't right and only mentions the compression service. I definitely wasn't squashing anything so I am not sure what is wrong with my git foo. Revision 1879 contains the initial pass that has issues with concurrent failures during recovery. AgreementSite changes quite a bit between 1879 and 1880.
If the mailing list is any indication it seems like they have some cool scripts for validating patches submitted to the project so it is strange that they don't have a commit hook that cracks down on tabs. Maybe I am projecting my loathing of tabs too much.
I really wanted to do a better job on this and release something that functions independently so it would have utility outside of VoltDB, but there is a lot of time pressure to finish command logging. It is also pretty horrific to not have brought along all the ZooKeeper tests, and I am concerned about bringing in changes from future versions of ZooKeeper. There are some really nice enhancements in the pipe like ephemeral parent znodes and multi-znode update.
I was able to get a ton of reuse out of RestrictedPriorityQueue, ExecutorTxnIdSafetyState, and TransactionIdManager. I think that validates the simplicity of the global ordering approach. That stuff dropped right in without any issues. The failure stuff was still hard to get right, but it was mostly my fault for not following the established pattern. All told it still went faster then doing failure and recovery the first time with ExecutionSite.
Modifications to ZooKeeper
I ended up removing the quorum package and replaced it with org.voltdb.agreement.AgreementSite which is similar to an execution site except that it is also an initiator. The agreement sites are all replicas of the same partition, and there is only one, and this partition contains the globally agreed upon state in ZooKeeper format. I kept org.apache.zookeeper.server.ZooKeeperServer and org.apache.zookeer.server.NIOServerCnxn so as far as clients are concerned we are still speaking the ZooK eeper wire protocol.
I modified the ZooKeeper client so it no longer does any ping and timeout work because the connections are always local. The ping stuff would be nice to keep because it ensures failure detection in the case of liveness issues and would allows us to fail the process instead of halting, but that is another thing I will have to come back to.
The ZooKeeper server to server stuff is all hijacked via a callout interface passed to ZooKeeperServer at instantiation time. The callout passes all requests to the AgreementSite. The agreement site agrees on the global ordering and replication with all the other AgreementSites before submitting them to ZooKeeperServer for execution. The callout is also used to get session ids (using txn ids) to the NIOServerCnxn.Factory thread.
I collapsed the code in PrepRequestProcessor and FinalRequestProcessor into ZooKeeperServer and made the entire thing single threaded. The fine grained locking necessitated by the whole RequestProcessor chain made me nervous and it was not something I wanted to deal with.
I also removed disk based durability since we are only using this as a coordination mechanism. That also helped get rid of threads since IO in Java is blocking. If we want to add that back we can do it with our existing snapshot and command log infrastructure. We don't want a completely separate process ruining our carefully orchestrated sequential IO anyways. Kudos to ZooKeeper for pre-allocating those log files.
No resuming sessions
I also gave sessions the axe as far as being able to resume a session at another node. As far as we are concerned clients will always be local so we are not worried about retaining ephemeral nodes across a network partition. Ephemeral nodes still work in that closing a client connection will close the session, and failing (or network partitioning) a VoltDB node will cause all sessions hosted at that node to be closed globally.
I could do the work to keep sessions, but the value proposition isn't there. If you avoid ephemeral nodes when connecting from remote hosts you should still get the right behavior.
What I ended up with is nowhere near as generally applicable as ZooKeeper, but it is sufficient for a tightly coupled system like VoltDB, and certainly easier to embed. It is basically a transactional peer to peer agreement service. Now that the agreement and recovery mechanism match what we would normally do it will be easier to support the same set of topology changes. The stronger guarantees we can make to clients (since clients are local and in process) WRT to failure conditions is also nice.
Thoughts on ZooKeeper
I liked ZooKeeper enough to invest a month adapting it for VoltDB so realize that I like the project even though there were some things not to my taste.
I found the ZooKeeper code to be very DRY to the point of being difficult to follow. The relationship between ZooKeeperServer and its descendants and how they reconfigure themselves (or override each other) for use in different roles is pretty complex. The relationship between the various Quorum classes is also complicated in the same way. Determining exactly how each one is configured is a challenge since there are multiple components configured differently from multiple places as well as function overriding combining to make configuration take place in the derived class as well as in base classes.
It's not a problem once you get it though. It took about a week to be confident enough with the innards to start extricating everything from the quorum way of doing things. I still don't think I can trace the life of a request though the request processors and sync/ack dependencies though.
I wasn't a huge fan of the shared state requiring synchronization such as ZooKeeperServer which is referenced by the request processors and by a callout from SessionTracker. As far as I can tell the locks only go one level deep so out of order lock acquisition is probably not a huge issue, but reasoning about the possible interleavings is pretty hard since it doesn't follow any general pattern. The request processor chains seem like something that could be simpler and single threaded. If the goal is parallelism then I have to wonder if there is some more explicit way to discover it in the workload.
The NIO implementation is impressively compact (single threaded NIO is the only way to roll), but it looks like there is some duplication in the client and server code. They are actually coded pretty differently which makes me wonder why the network code isn't factored out more. I haven't looked at that stuff too closely since I didn't make many changes to it.
Up next, log truncation
I can finally get back to log truncation which is the thing that got me started on ZooKeeper. The process is asynchronous in nature and requires distributed coordination and leader election. Rather then accrue more technical debt we decided to bite the bullet and get ZooKeeper going.