It's more likely then you think. There is nothing about the architecture of VoltDB that prevents the use of write ahead logging. A global order of txns exists and if the txns are persisted to disk before the response is relayed to the client then it can be replayed after a failure. Truncating the log is as easy as identifying the txn id where a snapshot begins and then truncating everything before it once the snapshot successfully ends. The easiest way to make the log truncatable would be to create a new log file after every snapshot is initiated. Hat tip to Cassandra for that idea.
A typical VoltDB deployment with WAL would require a commit log disk and a snapshot disk. A battery backed cached for the commit log disk is a must if you want synchronous commits. RAID 1 isn't necessary because the commit log and snapshots are being generated at other replicas. The snapshot disk doesn't need to be high end although doing snapshots more frequently will speed recovery by requiring less of the log to be played back. You certainly don't want the snapshot device competing for resources with the commit log drive and the battery backed cache. All the disk IO is sequential and there are no application reads so how much IO you need is going to depend on the vertical size of your nodes and the nature of your workload. Many small procedure invocations won't require much IO, but larger invocations with 10gig-E could easily exceed the capacity of cheaper disks and become a bottleneck.
The next question is where to maintain the WAL.
Option A: Every initiator in the system could persist invocations to disk and global ordering messages before forwarding txn responses to the involved partitions that will execute them. A somewhat hidden caveat is that this introduces a SPOF because initiators aren't replicated. This means at least RAID-1 and access to ALL failed nodes (or at least their commit logs) in order to recover
Option B: The involved partitions could persist the txns to disk once the global ordering is complete. That might simplify the replay process although I haven't decided how true that is. The disadvantage is that the txn is not being persisted to disk concurrently with the global ordering process and this adds latency that is mitigated to some degree by a battery backed cache.
Option C: Another option is for every involved partition to persist txns to disk in order they are received and then rebuild the global order on playback. The disadvantage is that heart beat and txn ordering messages would also have to be persisted so that the global ordering can rebuilt at the involved partitions on playback. Code complexity wise this may or may not be simpler because it follows the regular operational code path of building the global order. The overhead of rebuilding the global order is computationally negligible and the IO shouldn't be too bad for average size clusters. The advantage of this approach is that the involved partitions can begin persisting txns and global ordering metadata as soon as it arrives from the network without waiting for the txn to be safe to execute. However the global order meta data that makes a txn safe to execute must still be persisted before the txn is executed for it to be re-playable.
Option B vs. C becomes a question of how long it takes to persist the metadata vs. how long it takes to persist the invocations themselves. At some point in invocation size there is a tipping point where it is advantageous to persist the invocations as early as possible. This tipping point may be strongly correlated with the size of your battery backed cache and overall IO throughput. My intuition is that the invocations would have to be quite large.
Of the three options I think that C is the most attractive. It is most likely to reuse the existing hammers for introducing and executing txns without introducing new variations on existing code. The processing of logging messages as they come in isn't particularly interesting nor unique to VoltDB so I won't go into it any until I have finished an implementation.
So what kind of penalty will VoltDB pay for WAL? It won't get slower as long as you don't run out of IO since the disk writes are sequential. Log maintenance will be pretty simple since everything is already serialized and ready to persist. If logging is synchronous there will be a latency penalty. Currently latency is in the single digit millis. This is a result of the global ordering process that trades some latency for scalability and throughput. I won't guess how much latency synchronous WAL might introduce without an implementation.
Don't expect see WAL in a production VoltDB release anytime soon. It isn't on the schedule right now. This is just a weekend(s) project that may or may not turn into a production ready feature.
Statements and opinions presented here are not those of VoltDB Inc. unless specifically presented as such.
No comments:
Post a Comment