Saturday, September 18, 2010

Linux file IO defaults

At one point this week our FE Tim Callaghan ran a snapshot on a k=1 cluster with two nodes that were %70 full. He found that the cluster became unresponsive during the snapshot. Reproducing the problem with top running showed that the nodes were running out of free memory even though the amount of memory allocated to the VoltDB process was not changing. It turned out that the kernel was caching all the writes going to the snapshot files instead of flushing them to disk. The problem is that once memory has been dedicated to caching disk writes it can no longer be used to satisfy regular memory requests until the data is flushed to disk. If the write cache represents all available memory then things start to get sad because there is no way to free memory up, especially when VoltDB continues to write snapshot data as fast as possible.


I wrote the snapshot code as if I was writing to a TCP socket that would exert back-pressure. I assumed that the write calls would block or return 0 rather then exhaust all available resources. That was a mistake.

A little digging revealed that the CentOS defaults allowed the kernel to dedicate up %40 of the systems total memory before writes return 0 or block. Even more bizarre is that kernel would not begin a background flush of data to disk for 30 seconds or until %20 (can't remember if that % is correct, but it is close) of system memory was caching pending writes. On a node with 48 gigabytes of RAM this is insane. That these values are % of total system memory rather then total available memory is also strange.


Leaving the disk idle in the absence of IO contention of any kind is an odd choice. Just because my application doesn't call fsync doesn't mean I want the device to twiddle its thumbs and waste precious bandwidth, increasing the time to complete any eventual fsyncs.

The kernel used by CentOS doesn't support specifying the dirty_ratio (%memory cached before write returns 0 or blocks) in bytes or setting it to less than %5 (still 2.4 gigabytes). The dirty_background_ratio (%memory cached before kernel will flush data to disk in the background) can be set to 1, but it can't be specified in bytes. Newer kernels (Ubuntu 10.04) support specifying these values in bytes and allow for pretty much any sane quantity.

It is bizarre to me that a distro billed as a server distro ships with what seem like desktop defaults.

I didn't spend any time tuning these values to see if they could fix the issue. Asking users to tune the kernel on a system that may have other tasks running is not an option. They also won't do it reliably and having anything break or not perform is not acceptable.

[tangent] I have a similar attitude towards Java GC tuning. Other then specifying -Xmx and Xms it should just work. Luckily VoltDB stores all mutable persistent data on the native heap so a large Java heap is not required and virtually everything dies in eden. I think we dodged a bullet there.[/tangent]

For now the solution is to call fsync every 10 seconds or X number of megabytes, whichever happens first. My concern is that this may leave the disk idle since the kernel is apparently not going to flush writes in the background. What I want is a sync call that causes background flush and doesn't prevent me from continuing to write data. Another option would be tunables for a specific file descriptor. This is all taking place in Java so my options for using native APIs are limited although there are ways of finding the FD and doing something crazy if the win is compelling enough.

The eventual solution is probably going to be to drain half of all available serialized buffers and then sync. The drain process is all in memory so it should only leave the disk idle momentarily while the other half of the available buffers are written to the file. Serialization is so fast that I doubt that this process will starve for available buffers to drain.

Statements and opinions presented here are not those of VoltDB Inc. unless specifically presented as such.

3 comments:

  1. We have fun like this with MySQL too -- http://dom.as/2010/11/18/logs-memory-pressure/

    ReplyDelete
  2. sounds like you want O_DIRECT..... no idea how to do that from Java of course :)

    ReplyDelete
  3. As Stewart suggest above, you could give O_DIRECT a try. There's a ticket on the Cassandra JIRA which has a discussion related to this issue (they used JNA to make native calls).

    https://issues.apache.org/jira/browse/CASSANDRA-1470

    ReplyDelete