The command log in VoltDB consists of stored procedure invocations and their parameters. The log is created by execution sites at each node. Because execution sites contain partition replicas you will end up with a replicated command log that you can dedupe at replay time. The transactions are logged concurrently with the global ordering and safe replication process in order to minimize any added latency when logging synchronously. With a battery backed cache there is 0 added latency. I need to experiment with straight disk and find out how often you can get fsyncs with a pre-allocated file.
The command log is truncated by taking a non-blocking ACID snapshot of the database and then truncating log entries that takes place before the transaction id of the snapshot. Replay is done by loading the snapshot and then replaying all the log entries with a timestamp after the snapshot.
Some interesting findings WRT to disk based logging in general.
You want to preallocate log files. Extending a file synchronously requires two disk seeks because you need to log the change in file length in addition to the actual data. If you know you will have a battery backed cache then it isn't necessary. Another advantage is that you can bound the size of the log and stop taking write requests when the log is full. This bounds the amount of time it takes to replay the log. If your log is growing that big then there is a problem with the truncation process and you want to find out before recovery time becomes prohibitive.
I am really into the preallocation thing. I really like being able to have predictable space usage that can't be interfered with by a runaway external process. I would really like to make the snapshot process for log truncation also write to preallocated files so that the space is guaranteed to be there as well, but that isn't likely to happen soon. Some other questionable advantages are no fragmentation, and giving the filesystem an opportunity to lay out things sequentially.
If you want 1 millisecond fsyncs you need to use a different log thread for each fsync. If you use a single thread for all the logging then you will get no respect from the scheduler. Generally speaking you won't get timely behavior from Thread.sleep() or Object.wait() from threads that have clocked a lot of execution time. You may also see threads that clock a lot of execution time disappear for a few milliseconds every once in a while even if they are ready to run. Initially I just kicked off a new thread before starting an fsync, but that made Eclipse angry attaching a debugger so I switched to a circular thread pool. A dedicated thread for doing log IO means that adding an entry to a log only involves interacting with a blocking queue and that minimizes contention for the log.
Because the log is replicated you will end up with several logs to choose from at replay time. If the cluster terminated at the same time then you can flip a coin and pick any log as one is as good as the next, but if some nodes witnessed failures and survived past them then they will have logs with transactions that have been acked to the client that are not in the others. You could scan every log and find out which ones are the longest, but if they are close enough then you might not actually be able to tell down to the very last transaction (might miss one that has been acked to the client).
Survivors definitely have to log any failures they witness so that the failed log can be discarded. Failures are infrequent so there is no reason to log them in the sequential command log. This saves you the trouble of having to do a preliminary scan up front in order to decide which logs to use during replay.
I should mention that when a node becomes unresponsive the cluster stops making progress until it can agree on what the set of failed nodes is. The agreement is always for the entire set and not what is newly failed. Failures logged to disk also include the entire set. Failures are always logged to disk synchronously.
So what happens if node B fails, and A and C witness the failure. Then node A fails and C is the only survivor. Then B rejoins the cluster. A on disk will have B logged as failed, and B will have A logged as failed. If the entire things come crashing down and you bring A,B,C up for replay then you will have two nodes claiming the other's log is invalid, when in reality B's log is valid. In this k=2 case it isn't an issue because you only need C to recover, but it is possible to create scenarios where discarding Bs log would result in an unrecoverable cluster. Now you need a way to determine what set of failures is authoritative.
Clock skew makes timestamps unattractive, so we will go with sequence numbers. Now we have yet another piece of meta-data to propagate when a node rejoins the cluster and when a cluster is restarted. You definitely don't want to reset to 0 in case an old node is with fault logs is brought back.
So how does the log behave when you rejoin a node? For now we always have a rejoining node start empty including discarding log data. One of the first steps in the rejoin process is having the rejoining node log the current failed node set (discovered from the cluster) along with itself. The log on the rejoining node isn't viable for replay so it must mark itself as failed until it has logged enough data to be replayable. The rest of the nodes in the cluster then log the rejoining node as not failed once it has finished marking itself as failed so that there is no window where the rejoining node's log is mistakenly considered viable.
There is some dark magic involved in replay. I am not going to go into it in detail (can't kill all the mystery), but we intend to support topology changes where the cluster grows and some topology changes where the cluster shrinks (presumably a failed node can't be brought back). Not only are the logs replicated, but each log file contains streams of transactions from several execution sites, and the log is not sorted. For replay purposes the txns have to be emitted in order so it is necessary to sort the entries.
Turns out that it can be done in memory with bounded space and a single sequential IO pass.