Thursday, December 6, 2012

VoltDB 3.0 Export enhancements

A new feature of VoltDB 3.0 that I haven't talked about much is the next iteration of export that Stefano Santoro and myself have been working on.

What is export?

VoltDB was never intended to be a jack of all trades master of none database solution. VoltDB doesn't set out to solve the issues presented by append only mostly immutable datasets with distinct hot/cold data nor does it attempt to address long running analytic queries. What VoltDB does try to do is integrate with those systems so you can get your data where it needs to be when it needs to be there so you can solve your big picture problem.

To facilitate that integration VoltDB allows you to declare export only tables in your DDL and guarantees at least once delivery of every row inserted into those tables to an export connector. If you insert rows into an export table and the transaction rolls back the exported rows are also rolled back. The database hangs on to data provided to the connector until the connector acks it so that the connector can wait for the data to commit at the companion system.

In 2.x VoltDB the export connector made the data available over a socket so it could be sucked down by an external client process. We never supplied a scalable or highly available external client solution which was something of an impedance mismatch when you consider the reasons people ended coming to VoltDB in the first place.

Integration options consisted of external clients that convert the data to rolling CSV files, export to Hadoop via Sqoop, or bulk load into Postgres via JDBC thanks to a community contributed driver from chungy.

Another useful feature of export is that it bypasses VoltDB's in memory limitations by allowing export data to overflow to disk as sequential IO when the companion system is slow or unavailable. It also makes it possible to shutdown VoltDB while the companion system is unavailable without losing any export data.

3.0, scalable, highly available, fast

The bandwidth utilization of an external client was really unattractive, and it's pretty clear that we have a glut of CPU capacity for most transaction sizes so for 3.0 we made it possible to host the export client inside the VoltDB server process.

Data no longer has to make it over the wire before it can be exported directly to an external system. 2.x export functionality and clients are still supported, and the path to converting existing clients is fairly straightforward. We are definitely moving away from external export clients, but there is no rush because the development overhead of supporting both is not bad.

Moving export processing into the server gave us scalability and high availability with very little effort. The server is already sharded down to the core level, and we already handle replication and fail-over on a per partition basis. Internally export was already sharded into a data source per partition and was even presented to client applications as a stream per partition which makes updating clients for 3.0 painless.

The threading for export was also changed from a single global thread to a thread per export data source (thread per table per partition) which provides parallelism and prevents blocking disk IO from preventing export from making progress.

The cluster attempts to load balance partition mastership across nodes so we tied the responsibility for exporting data for a given partition to the mastership of that partition. When a row is inserted into an export table it is stored at every replica of the partition, but only the master will create an export decoder to process the data. Once the export decoder reports the row as committed the master forwards the commit point to all the other replicas so the exported data can be discarded.

On failure a replacement export decoder is created at all the promoted replicas and this is where you can see a row delivered twice because an ack from the failed master might have been dropped.

More integration options via JDBC

The enterprise edition of VoltDB comes with a JDBC export client that has been tested with Netezza, MySQL, Postgres, Oracle, and SQLServer. I am hoping to validate with Redshift as well, but I am waiting to hear from Amazon to see if I made it into the beta.

The new driver attempts to use JDBC's bulk loading API if it is supported (it is supported in the tested drivers) and connects to the companion database with a  connection per partition so the companion database will be able to use parallelism if available.

We didn't get quite as many integration options in as I wanted, but that leaves something to look forward to in post 3.0 releases. I think HDFS, HBase, and Cassandra drivers are a must. There is even a community user interested in using export to implement MySQL binary replication.


A very brief performance number

On my desktop with a 7.2k 500 gigabyte desktop drive using VoltKV with 1k payloads and a 100% write workload I can export 100 megabytes/sec or so. The export to file client is hex encoding the binary payloads as it exports them so there is a blow up there from the ingest rate which is actually 50k/sec or so.

My intuition is that the limit on export throughput is going to be the ingest rate because the process of exporting a row is very efficient. As soon as you insert the row it is serialized into a buffer, and once the buffer is full (or ages out) it is handed off to the export client.

This all assumes you are exporting to something that can keep up. I know sequential disk IO scales to 200 megabytes/sec easy, but I am not so sure about JDBC targets.

If the Redshift thing comes through I would like to put together a more interesting benchmark using TPC-C to generate data by exporting ORDERS, ORDER_LINE, and HISTORY.




Wednesday, December 5, 2012

Mischievous multimaps

I noticed a pitfall in multimap (non-unique) indexes that has bitten several users and it doesn't seem to be widely known.


The problem

A multimap index has a linked list for each key. This list can be long depending on the data model. This can make deleting all or some of the values for a key very slow in several cases.

It is possible to perform deletion in linear time if you delete all rows for a given key because the database can walk the list once and free each node. However Volt doesn't implement deletes this way.


A delete statement starts by materializing the list of rows to delete, and then looks up and deletes the row in each index one by one. This results in n2 time complexity to do the delete because the entire linked list of rows for the key may have to be walked to find the row to delete.


If you are deleting the rows based off of some other key then there is no fast path for the database to remove all the entries from the index. It is going to have to do a lookup for each row and the time complexity is going to be linear instead of logarithmic. It's also going to be several cache misses each time so the constants are bad as well.


You can get deceptively good performance from this kind of index because range scans, and inserts will be logarithmic, and in some cases deletes may be logarithmic if you happen to delete the rows in the right order, but worst case delete performance is still n2.


Lookups will also be deceptively fast (except for exceptionally high cardinalities) because an execution engine can churn through a large list in a handful of milliseconds and the database parallelizes separate transactions. Unless your transaction rates are high enough you won't notice the slow down until you get to the delete stage where the execution time goes from linear to n2.

The workaround

If you have a problem multimap index that has a high cardinality for some keys you can work around the long chain issue by adding columns to the end of the key to make it unique. That will bring back logarithmic performance because the execution engine will now be doing point lookups.

This won't effect using the index as multimap because you can still do range scans on the higher order keys. I am assuming a tree index because you should almost always opt for Volt's tree indexes.

3.0 supports add/drop index without reloading the data set so you can replace indexes as necessary if you run into this.

Sunday, September 30, 2012

VoltDB 3.x features and performance

3.x is far enough along now that I can start taking some performance numbers and know that this is what you can expect from the finished product. This post only covers features related to transaction initiation improvements.

3.x introduces a number of performance optimizations that help smooth out the performance curve in small clusters and applications where developers choose to use the database synchronously.


All the performance measurements taken for this post were taken using a single measurement per configuration. I had to run a large matrix and I didn't have time to automate or run the tests multiple times to get an average. The 2.x cluster was intentionally set up with NTP misconfigured to demonstrate typical user experience, and it was set up and torn down several times during the testing process so the clock skew varied. The synchronous tests aren't focused on extracting maximum throughput so concurrency was bounded to 25 client threads per server node.


You can click on the graph images to go to the spreadsheet and interactive version of the graph.

Packet pushing limits

The work I have done on 3.x has made me realize that the primary bottleneck for VoltDB is getting packets in and out of each node. You get a fixed number of packets and that number decreases slightly as packet size goes up, but however you divvy up the messaging workload that is the upper bound on how many discrete transactions you can do.

The limit on packets also happens to be disturbingly low, especially in EC2 where getting more than 100-200k packet through a node is optimistic. I hear tell that VMWare is better at this, but that is something I need to test myself.


How you measure is significant, because micro-benchmarks will lie and tell you that you should be able to do much more. Once you put a full application on top and start running traffic from multiple places you will see a very different number. It's tempting to think your app is just not shared nothing enough, but you won't see the CPU utilization or contention metrics to support that.


TCP no delay also makes a huge different because it does a good job of coalescing messages, but makes it tricky to adequately support latency sensitive synchronous clients. Not using TCP no delay will increase the number of messages you can send over a socket from 2-9x. The win is much bigger proportionally when you are using Xen on EC2.


The most interesting finding I made is that no matter how much work I did to make VoltDB more shared nothing it all seemed to have no effect. Most of those things were slower in single threaded terms (but more scalable and non-blocking) and resulted in higher CPU utilization but no change in actual throughput.

Client affinity

The first feature I want to introduce is client affinity. Client affinity is the ability of a client library to route a transaction to the node containing the master of a partition when the transaction is a write, and to load balance across the master and replicas if the transaction is read. Client affinity can greatly reduce the number of packets the cluster has to move to service each transaction.

Client affinity requires the client library to poll the database for the stored procedure schema and topology of partitions across the cluster so that it knows how to hash the parameters to stored procedure invocations and once hashed where to route the transaction so that it arrives at the correct partition.

Currently the only client library to implement client affinity is the Java client library. It's a small amount of work to do it in the C++ client library. Idiomatic client libraries like PHP may be a bit more challenging because the API doesn't map well to a clustered database, but it is doable.

The following chart compares the performance of a 1-5 node cc1.4xlarge cluster running a 50/50 read write key value workload with a 30 byte key and 1000 byte payload. Every time a server node is added another client node is added. Synchronous clients are set to have 25 threads and there is a synchronous client per client node.


You can see that routing requests to the right place gives you that nice straight red line because there is no special case at 1 and 2 nodes where most requests are being routed to the correct node making the base case faster.


Client affinity + replication

With replication enabled it is possible to see a gain in throughput for reads because reads for a given key are now being served by two nodes. This required changes on the server to allow requests for local partitions to short circuit the master for the partition and be served directly by the replica. The following chart demonstrates the effect of load balancing reads using client affinity over different mixes of read/write load.



In the RO only workload you can see that client affinity yields a more than 2x improvement. Part of this comes from always routing requests to the correct place, and the other part comes from doubling the number of hosts serving reads via short circuit reads. The improvement continues to be beneficial for the 95/5 workload.

The next chart shows the performance with and without client affinity for a 50/50 read/write workload. The short circuit read optimization doesn't yield as big a benefit here, but it is still beneficial and routing writes directly to the master is still saving performance for 3.x. 2.x (not shown in this graph) on the other hand is peer to peer multi-master and isn't actually possible to route requests to the right place so 2.x tends to pay a request routing penalty all the time.


 It's a small lie to say that you can't route to the right place in 2.x. You can route requests to the right place, but that results in worse latency due to the global ordering mechanism relying on constant traffic to propagate ordering information.



Synchronous clients and latency

Another focus for 3.x was making it possible for users to write applications that make synchronous calls to the database without totally destroying performance.

The following chart shows the performance of synchronous clients on the same 1-5 node k=0 cluster running the same key value workload.


2.x shows no performance increases because the clock skew is high and there is very little concurrency presented to the database. 3.x continues to add throughput at a consistent rate, but per node performance is much better with client affinity. The additional latency of misrouting requests is too much at the low level of concurrency being presented to the database.

Command logging

It's not demonstrated here because I didn't set up command logging. Setting up command logging on EC2 is labor intensive and time consuming because you have to prep the ephemeral disks and that can take an hour each time I bring up a cluster.

A key improvement for synchronous command logging is that writes that are blocked waiting for the command log to sync to disk will no longer block reads. This means lower latency for reads which is critical for synchronous clients.

Compared to 2.x

The following chart shows the performance of asynchronous clients for the same 1-5 node k=0 cluster running the 50/50 read/write workload.


3.x is able to perform much better without replication because transactions are routed directly to the node that is going to execute them resulting in no forwarding. 2.x has 4 in 5 chance that it is going to have to forward each transaction at 5 nodes.

The following chart shows performance of asynchronous clients for the same 1-5 node k=1 cluster running the 50/50 read/write workload.


3.x benefits from extra capacity by doing reads at replicas and not having to forward write transactions.

Conclusion

For read-only or 95/5 workloads the performance difference between 2.x and 3.x is going to be dramatic. For the 50/50 workloads it isn't going to be quite as striking, but it is still nothing to sneeze at.

If you are using the database with limited concurrency or are failing to configure NTP correctly you should see a huge improvement in throughput and latency.

All these numbers are on EC2. I imagine the bare metal numbers are going to be different, perhaps strikingly so, but these are more reproducible.

You can see the full set of numbers and graphs in the spreadsheets I created. There is one for the 50/50 RW workload w/o replication, the 50/50 RW workload w/replication, and the write only workload w/replication. The write only sheet also contains the 5 node read-only workload comparison.

Tuesday, September 11, 2012

A fun Valgrind error


==10571== More than 10000000 total errors detected.  I'm not reporting any more.
==10571== Final error counts will be inaccurate.  Go fix your program!
==10571== Rerun with --error-limit=no to disable this cutoff.  Note
==10571== that errors may occur in your program without prior warning from
==10571== Valgrind, because errors are no longer being displayed.

Monday, September 10, 2012

Avoiding RMR with Linux buffered IO

Read-Modify-Write

One of the problems with buffered IO and pre-allocated files is that partial page writes cause the kernel to read the partial page in from disk if it is not already in the page cache and then apply your write to the page. I thought read ahead at the kernel and disk level would mask this to a degree, but I turned out to be wrong.

Some light kernel reading

The question I had was whether doing page aligned (4k) writes with buffered IO would avoid the RMR cycle.

It turns out there are an impressive number of ways to enter into the filesystem layer to write to a file. You have page faults for memory mapped files, asynchronous draining of dirty pages, asynchronous IO, regular buffered IO, and probably a few more variations involving flags you can apply to buffered IO.

In ext4 the function table that provides implementations for the various methods required by the VFS layer ends up with a different set of functions depending on whether you have mounted the filesystem with data=[writeback | journal | ordered]. I don't write much C so it is fun to observe how the other half lives.

There are a lot of calls into the filesystem and back out to the VFS layer where the VFS layer manages incoming system calls and then manages the page cache before/after a write occurs.

O_DIRECT tangent

A lot of databases open files with O_DIRECT which explicitly bypasses the page cache and doesn't return until the write from disk completes.

According the man page you may you have to supply O_DIRECT with aligned writes, but I actually saw some ext4 code for handling unaligned writes for you. The code did print a log message once a day that performance was going to be poor and a comment mentioned that something would be serialized. It is holding a lock so I assume it is referring to serializing access to the file.

I think that it would actually be fine to provide unaligned writes to O_DIRECT for something like a write ahead log where only one thread is going to be using the file and you aren't going to mix O_DIRECT and buffered IO. However that assumes it isn't also going to block kernel threads that need to access the inode and that it is all turned into one disk transaction.

I also saw a comment that O_DIRECT doesn't work with journalling? It might have meant that it doesn't work with data=journal but that data=ordered would still work. I wonder what that means for the durability of a file opened with O_DIRECT when its length is extended.

Back to buffered IO

It looks like the page cache does check each page to see if the page is completely overwritten and avoids issuing reads. Assuming that there isn't some other boneheaded code issuing reads it should work.

Experimental validation

My desktop is still using ext3 so that is what the following tests used.

I created a quick test program to validate the desired behavior. If I run the program and let it fill the file and then drop the page cache it lowers write throughput from 100 megabytes/sec to potato as Linux's lovely buffered IO eats up every last free meg of RAM buffering pending writes. There is a small amount of read traffic which drops to zero when I run out of RAM. No crashes despite the fact that I don't run with swap.

If I change the write size to 4k (page aligned) write throughput only drops to 60 megabytes/sec and then picks up again once everything is back in the cache. Kind of a disappointing outcome, but I may have dropped other necessary data.

I switched to using a much larger 16 gigabyte file (12 gigs of RAM total) instead of dropping the page cache in order to avoid dropping the wrong data. That actually worked as expected and I saw 100 megabytes/sec of write throughput.

I went back to partial page writes with the 16 gig file to see if it performed better when I wasn't dropping the page cache. That ended up bizarrely ping ponging between read traffic and write traffic with very little throughput. The only plus was that my desktop didn't become unresponsive. The transaction rate under this load was very high for a spinning disk at a constant 4,500 TPS.

I echoed 1 to /proc/sys/vm/drop_caches so it isn't clear to me why that would have unexpected side effects. Other values like 2 or 3 drop additional filesystem metadata, but 1 should only drop the contents of files.

Changes to VoltDB

Based on this I really need to fix up command logging so it does page aligned writes. I had no idea that pre-allocating fell of a cliff like this if stuff isn't in the cache. I think it may explain some reports from the field.

That leaves three must have changes in the next release
  • Better replica placement (see the previous post) Done
  • Page aligned command log writes Done
  • Self tuning command log segment pool Done
Sadly I can only subcontract one of those.

Wednesday, September 5, 2012

10-gig E per socket limits and replication

Note that for this post socket always refers to TCP/UDP sockets and not a physical CPU socket.

Chasing 10-gig E with Cluster Compute instances

Most of my testing has been on EC2 cluster compute instances which I ragged on a bit in the past. After playing with them a bit more I am warming up to using them for performance work. I am not in love with the price for a full time cluster, but it is low enough that I can get access to 10-gig E clusters when I need them without having to have a bunch of highly contended hardware in the office.

They are definitely 50% slower, but they seem to bottleneck in same places and once the bottleneck is removed they improve the same amount percentage wise as bare metal.

I have been sticking with the Amazon AMI based on the micro-benchmarks I did that showed it to be faster than an Ubuntu 12.04 AMI at sending small packets between socket pairs. I am using TCP no delay all the time now to get a consistent baseline in terms of settings. I have found that adding more sockets and threads doing network IO nets you decent scale up in terms of number of packets which is good because the single socket performance is terrible. 

When I micro-benchmarked UDP on bare metal I actually didn't get any scale out beyond two sockets, but that was doing 900k packets/sec. TCP without no delay still wins at 1.2 million messages/sec.

What does 10-gig E take?

After much experimentation I think I finally understand what it takes to get good utilization out of 10-gig E. My initial thinking with 10-gig E was that I would be able it saturate it using a single socket if only I could make network code efficient enough.

I have done micro-benchmarks using send/receive pairs that show that you can at least saturate one direction using a single thread so it seemed like it might be possible. Martin Thompson points out you can use Unsafe to do message processing that is 4x faster. If I could use ByteBuffers with bounds checking for debug builds and have that automatically processed into Unsafe calls for production I would consider it. ByteBuffer has identified way too many bugs for me to give up on it and we do make use of views.

I am sure I could trim some fat and defer more serialization out of the network thread, but scaling out seems more future proof.

Uni-directional BIO w/NIO socket pairs

This thinking ended with separate sockets for send/receive and a dedicated thread doing blocking IO on each socket. Java BIO doesn't let you block for send/receive in separate threads, and Java NIO with separate threads doing blocking IO for send/receive turned out to be 3x slower which is why it actually took separate sockets. By having separate threads I could put a non-blocking queue in front of the send socket and the receive thread could place all incoming messages in non-blocking queues. The use of non-blocking queues ensures that I can see as much throughput as possible out of the single receive thread.

Non-blocking initiators w/thread per initiator

This seemed to remove quite a bit of red (blocking) from the profile and reduced CPU utilization, but didn't result in increased performance. It did require me to rewrite the IV2 initiator code to have a dedicated thread per initiator (2 threads per execution site). That was the easiest way to present the receive thread with a non-blocking queue instead of a critical section surrounding the initiator.

I think it is actually possible to fold the initiator into the execution site thread and remove the extra thread, but I don't see that as necessary long term. If the initiator thread is paired with it's execution site on the same core they can run concurrently via hyper-threading which could have interesting throughput/latency implications. The initiator thread could do the network IO using direct sockets to client/replicas while the execution site starts the next transaction.

If the two threads were scheduled together you would have parallel execution provided by the CPU so the latency impact could be bounded and they would share the same L1 cache. If you could do the socket IO for replication from the initiator thread you would be sending and receiving tasks from the correct cache set and the kernel might even be able to deliver incoming messages to the correct cache.

If only Java exposed a thread binding API. It's a strange thing not to have in what is one of the best languages and runtime for concurrency. As far as I can tell it is not even on the table for Java 8.

NIO with bi-directional socket sets

The next experiment was creating two sockets between servers. For this experiment I switch back to a bi-directional sockets using NIO with selectors and a dedicated thread + selector per socket. I honestly couldn't see a performance difference between this and a blocking thread implementation despite the fact that the NIO + selectors requires medium sized critical sections spanning several monitors for sending messages. Those critical sections do show up on profiles but it is still hard to cost model.

I modulo hashed the destination site ID to pick which socket I was going to use when sending messages. This preserved the TCP ordering guarantee between any two sites. This turned out a 2x performance improvement for the K/V application.

One socket not enough

At this point I think I have established that the socket used for replication is a bottleneck. These tests were all 2 node k=1 so it is a worst case scenario. It definitely looks like there will need to be special case code for very small clusters that use replication (ugh!). I doubt that stopping at two sockets is going to be sufficient, but I haven't taken it any further.

So scale out....!?

At this point you would think it would get better as you move up to larger cluster sizes. That would be true if the replica placement strategy didn't result in every other cluster size creating exact node replicas. The result is that throughput is all over the map in small clusters.

Since the replica partitions for node A are all at node B all replication traffic ends up going over the single socket to node B. Those clusters where nodes share replicas with two other nodes end up using two sockets and are 2x faster if workload is bound on replication messaging. The next size up might be the same or even slightly slower because replication ends up using a single socket.

This is less noticeable in the existing system because the high pole in the tent is elsewhere so you get a nice straight(er) line and most people have been testing with gig-E. Not a good thing, but it makes people happy to see a straight line.

This is not a factor in the existing system that produces a global order because replication is handled by initiators and there is always one initiator per node. The distribution of replication messaging overhead is guaranteed to be as even as the distribution of initiators which is perfect. There is no client affinity because that would cause the global ordering process to fall back to heartbeats and latency.

The upside is that it is maybe 20 lines of code to fix and it is low risk.

Feature bounty!

Replica placement has been fixed so the bounty is no longer available

I am offering a $30 gift certificate to Outback (or the restaurant of your choice) to the person who fixes the replica placement algorithm. You have to come up with an optimal implementation that will causes nodes to replicate to the largest possible number of other nodes. The distribution has to be even because the cluster will only operate at the speed of the slowest node so it won't help to have some nodes replicating to multiple nodes while one node replicates to only one other node. You also can't break the even distribution of partition mastership at startup.

As a stretch goal you can make it possible to select a target number of nodes to replicate with. It would be interesting to see if there is a performance impact if you involve too many nodes with replication. Completing the stretch goal will tack an extra $10 on to your gift certificate.

You will also need to sign the contributor license agreement for us to accept the code. I will update this entry if someone wins the bounty so you won't waste your time.

Tuesday, August 14, 2012

Great LKML thread from 2k3


I was looking into how to avoid partial page writes with Linux's buffer IO for VoltDB command logging. I noticed that there is still some read IO which I suspect is read ahead for partial page writes. I'll wager the penalty is small because both Linux and the disk drive will do read ahead combined with writes, but it is added noise.

Along the way I found some historical gems.

http://yarchive.net/comp/linux/o_direct.html

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: O_DIRECT performance impact on 2.4.18 (was: Re: [PATCH] 2.5.14
Original-Message-ID: <Pine.LNX.4.44.0205111047280.2355-100000@home.transmeta.com>
Date: Sat, 11 May 2002 18:06:36 GMT
Message-ID: <fa.m6uaeiv.150c98r@ifi.uio.no>
On Fri, 10 May 2002, Gerrit Huizenga wrote:
> In message <Pine.LNX.4.44.0205100854370.2230-100000@home.transmeta.com>, > : Li
> nus Torvalds writes:
> >
> > For O_DIRECT to be a win, you need to make it asynchronous.
>
> O_DIRECT is especially useful for applications which maintain their
> own cache, e.g. a database. And adding Async to it is an even bigger
> bonus (another Oracleism we did in PTX).
The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances [*].
It's simply not very pretty, and it doesn't perform very well either
because of the bad interfaces (where synchronicity of read/write is part
of it, but the inherent page-table-walking is another issue).
I bet you could get _better_ performance more cleanly by splitting up the
actual IO generation and the "user-space mapping" thing sanely. For
example, if you want to do an O_DIRECT read into a buffer, there is no
reason why it shouldn't be done in two phases:
(1) readahead: allocate pages, and start the IO asynchronously
(2) mmap the file with a MAP_UNCACHED flag, which causes read-faults to
"steal" the page from the page cache and make it private to the
mapping on page faults.
If you split it up like that, you can do much more interesting things than
O_DIRECT can do (ie the above is inherently asynchronous - we'll wait only
for IO to complete when the page is actually faulted in).
For O_DIRECT writes, you split it the other way around:
(1) mwrite() takes the pages in the memory area, and moves them into the
page cache, removing the page from the page table (and only copies
if existing pages already exist)
(2) fdatasync_area(fd, offset, len)
Again, the above is likely to be a lot more efficient _and_ can do things
that O_DIRECT only dreams on.
With my suggested _sane_ interface, I can do a noncached file copy that
should be "perfect" even in the face of memory pressure by simply doing
addr = mmap( .. MAP_UNCACHED .. src .. )
mwrite(dst, addr, len);
which does true zero-copy (and, since mwrite removes it from the page
table anyway, you can actually avoid even the TLB overhead trivially: if
mwrite notices that the page isn't mapped, it will just take it directly
from the page cache).
Sadly, database people don't seem to have any understanding of good taste,
and various OS people end up usually just saying "Yes, Mr Oracle, I'll
open up any orifice I have for your pleasure".
Linus
[*] In other words, it's an Oracleism.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Original-Message-ID: <Pine.LNX.4.44.0310100958030.20420-100000@home.osdl.org>
Date: Fri, 10 Oct 2003 17:06:46 GMT
Message-ID: <fa.kh23j86.1ajelgg@ifi.uio.no>


On Fri, 10 Oct 2003, Chris Friesen wrote:
>
> How does this play with massive (ie hundreds or thousands of gigabytes)
> databases? Surely you can't expect to put it all in memory?
Hey, I'm a big believer in mass market.
Which means that I think odd-ball users will have to use odd-ball
databases, and pay through the nose for them. That's fine. But those db's
are doing to be very rare.
Your arguments are all the same stuff that made PC's "irrelevant" 15 years
ago.
I'm not saying in-memory is here tomorrow. I'm just saying that anybody
who isn't looking at it for the mass market _will_ be steamrolled over
when they arrive.
If you were a company, which market would you prefer: the high-end 0.1% or
the rest? Yes, you can charge a _lot_ more for the high-end side, but you
will eternally live in the knowledge that your customers are slowly moving
to the "low end" - simply because it gets more capable.
And the thing is, the economics of the 99% means that that is the one that
sees all the real improvements. That's the one that will have the nice
admin tools, and the cottage industry that builds up around it.
Linus
This got me fist pumping
On Fri, 10 Oct 2003, Joel Becker wrote: > > I hope disk-based databases die off quickly. > > As opposed to what? Not a challenge, just interested in what > you think they should be. I'm hoping in-memory databases will just kill off the current crop totally. That solves all the IO problems - the only thing that goes to disk is the log and the backups, and both go there totally linearly unless the designer was crazy. Yeah, I don't follow the db market, but it's just insane to try to keep the on-disk data in any other format if you've got enough memory. Recovery may take a long time (reading that whole backup into memory and redoing the log will be pretty expensive), but replication should handle that  trivially.