What is export?
VoltDB was never intended to be a jack of all trades master of none database solution. VoltDB doesn't set out to solve the issues presented by append only mostly immutable datasets with distinct hot/cold data nor does it attempt to address long running analytic queries. What VoltDB does try to do is integrate with those systems so you can get your data where it needs to be when it needs to be there so you can solve your big picture problem.
To facilitate that integration VoltDB allows you to declare export only tables in your DDL and guarantees at least once delivery of every row inserted into those tables to an export connector. If you insert rows into an export table and the transaction rolls back the exported rows are also rolled back. The database hangs on to data provided to the connector until the connector acks it so that the connector can wait for the data to commit at the companion system.
In 2.x VoltDB the export connector made the data available over a socket so it could be sucked down by an external client process. We never supplied a scalable or highly available external client solution which was something of an impedance mismatch when you consider the reasons people ended coming to VoltDB in the first place.
Integration options consisted of external clients that convert the data to rolling CSV files, export to Hadoop via Sqoop, or bulk load into Postgres via JDBC thanks to a community contributed driver from chungy.
Another useful feature of export is that it bypasses VoltDB's in memory limitations by allowing export data to overflow to disk as sequential IO when the companion system is slow or unavailable. It also makes it possible to shutdown VoltDB while the companion system is unavailable without losing any export data.
3.0, scalable, highly available, fast
The bandwidth utilization of an external client was really unattractive, and it's pretty clear that we have a glut of CPU capacity for most transaction sizes so for 3.0 we made it possible to host the export client inside the VoltDB server process.
Data no longer has to make it over the wire before it can be exported directly to an external system. 2.x export functionality and clients are still supported, and the path to converting existing clients is fairly straightforward. We are definitely moving away from external export clients, but there is no rush because the development overhead of supporting both is not bad.
Moving export processing into the server gave us scalability and high availability with very little effort. The server is already sharded down to the core level, and we already handle replication and fail-over on a per partition basis. Internally export was already sharded into a data source per partition and was even presented to client applications as a stream per partition which makes updating clients for 3.0 painless.
The threading for export was also changed from a single global thread to a thread per export data source (thread per table per partition) which provides parallelism and prevents blocking disk IO from preventing export from making progress.
The threading for export was also changed from a single global thread to a thread per export data source (thread per table per partition) which provides parallelism and prevents blocking disk IO from preventing export from making progress.
The cluster attempts to load balance partition mastership across nodes so we tied the responsibility for exporting data for a given partition to the mastership of that partition. When a row is inserted into an export table it is stored at every replica of the partition, but only the master will create an export decoder to process the data. Once the export decoder reports the row as committed the master forwards the commit point to all the other replicas so the exported data can be discarded.
On failure a replacement export decoder is created at all the promoted replicas and this is where you can see a row delivered twice because an ack from the failed master might have been dropped.
More integration options via JDBC
The enterprise edition of VoltDB comes with a JDBC export client that has been tested with Netezza, MySQL, Postgres, Oracle, and SQLServer. I am hoping to validate with Redshift as well, but I am waiting to hear from Amazon to see if I made it into the beta.
The new driver attempts to use JDBC's bulk loading API if it is supported (it is supported in the tested drivers) and connects to the companion database with a connection per partition so the companion database will be able to use parallelism if available.
We didn't get quite as many integration options in as I wanted, but that leaves something to look forward to in post 3.0 releases. I think HDFS, HBase, and Cassandra drivers are a must. There is even a community user interested in using export to implement MySQL binary replication.
A very brief performance number
On my desktop with a 7.2k 500 gigabyte desktop drive using VoltKV with 1k payloads and a 100% write workload I can export 100 megabytes/sec or so. The export to file client is hex encoding the binary payloads as it exports them so there is a blow up there from the ingest rate which is actually 50k/sec or so.
My intuition is that the limit on export throughput is going to be the ingest rate because the process of exporting a row is very efficient. As soon as you insert the row it is serialized into a buffer, and once the buffer is full (or ages out) it is handed off to the export client.
This all assumes you are exporting to something that can keep up. I know sequential disk IO scales to 200 megabytes/sec easy, but I am not so sure about JDBC targets.
If the Redshift thing comes through I would like to put together a more interesting benchmark using TPC-C to generate data by exporting ORDERS, ORDER_LINE, and HISTORY.
If the Redshift thing comes through I would like to put together a more interesting benchmark using TPC-C to generate data by exporting ORDERS, ORDER_LINE, and HISTORY.