I was looking into how to avoid partial page writes with Linux's buffer IO for VoltDB command logging. I noticed that there is still some read IO which I suspect is read ahead for partial page writes. I'll wager the penalty is small because both Linux and the disk drive will do read ahead combined with writes, but it is added noise.
Along the way I found some historical gems.
From: Linus Torvalds <firstname.lastname@example.org>
Subject: Re: O_DIRECT performance impact on 2.4.18 (was: Re: [PATCH] 2.5.14
Date: Sat, 11 May 2002 18:06:36 GMT
On Fri, 10 May 2002, Gerrit Huizenga wrote:
> In message <Pine.LNX.email@example.com>, > : Li
> nus Torvalds writes:
> > For O_DIRECT to be a win, you need to make it asynchronous.
> O_DIRECT is especially useful for applications which maintain their
> own cache, e.g. a database. And adding Async to it is an even bigger
> bonus (another Oracleism we did in PTX).
The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances [*].
It's simply not very pretty, and it doesn't perform very well either
because of the bad interfaces (where synchronicity of read/write is part
of it, but the inherent page-table-walking is another issue).
I bet you could get _better_ performance more cleanly by splitting up the
actual IO generation and the "user-space mapping" thing sanely. For
example, if you want to do an O_DIRECT read into a buffer, there is no
reason why it shouldn't be done in two phases:
(1) readahead: allocate pages, and start the IO asynchronously
(2) mmap the file with a MAP_UNCACHED flag, which causes read-faults to
"steal" the page from the page cache and make it private to the
mapping on page faults.
If you split it up like that, you can do much more interesting things than
O_DIRECT can do (ie the above is inherently asynchronous - we'll wait only
for IO to complete when the page is actually faulted in).
For O_DIRECT writes, you split it the other way around:
(1) mwrite() takes the pages in the memory area, and moves them into the
page cache, removing the page from the page table (and only copies
if existing pages already exist)
(2) fdatasync_area(fd, offset, len)
Again, the above is likely to be a lot more efficient _and_ can do things
that O_DIRECT only dreams on.
With my suggested _sane_ interface, I can do a noncached file copy that
should be "perfect" even in the face of memory pressure by simply doing
addr = mmap( .. MAP_UNCACHED .. src .. )
mwrite(dst, addr, len);
which does true zero-copy (and, since mwrite removes it from the page
table anyway, you can actually avoid even the TLB overhead trivially: if
mwrite notices that the page isn't mapped, it will just take it directly
from the page cache).
Sadly, database people don't seem to have any understanding of good taste,
and various OS people end up usually just saying "Yes, Mr Oracle, I'll
open up any orifice I have for your pleasure".
[*] In other words, it's an Oracleism.
From: Linus Torvalds <firstname.lastname@example.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: Fri, 10 Oct 2003 17:06:46 GMT
On Fri, 10 Oct 2003, Chris Friesen wrote:This got me fist pumping
> How does this play with massive (ie hundreds or thousands of gigabytes)
> databases? Surely you can't expect to put it all in memory?
Hey, I'm a big believer in mass market.
Which means that I think odd-ball users will have to use odd-ball
databases, and pay through the nose for them. That's fine. But those db's
are doing to be very rare.
Your arguments are all the same stuff that made PC's "irrelevant" 15 years
I'm not saying in-memory is here tomorrow. I'm just saying that anybody
who isn't looking at it for the mass market _will_ be steamrolled over
when they arrive.
If you were a company, which market would you prefer: the high-end 0.1% or
the rest? Yes, you can charge a _lot_ more for the high-end side, but you
will eternally live in the knowledge that your customers are slowly moving
to the "low end" - simply because it gets more capable.
And the thing is, the economics of the 99% means that that is the one that
sees all the real improvements. That's the one that will have the nice
admin tools, and the cottage industry that builds up around it.
On Fri, 10 Oct 2003, Joel Becker wrote: > > I hope disk-based databases die off quickly. > > As opposed to what? Not a challenge, just interested in what > you think they should be. I'm hoping in-memory databases will just kill off the current crop totally. That solves all the IO problems - the only thing that goes to disk is the log and the backups, and both go there totally linearly unless the designer was crazy. Yeah, I don't follow the db market, but it's just insane to try to keep the on-disk data in any other format if you've got enough memory. Recovery may take a long time (reading that whole backup into memory and redoing the log will be pretty expensive), but replication should handle that trivially.