Sunday, April 15, 2012

JitCask

I started working on JitCask, a pure Java clone of Basho's BitCask, this weekend. One of the main strengths of BitCask's design is its simplicity. I was able to put together an implementation in < 8 hours that supports Put/Get/Delete and crash safe re-opening. My primary motivation was to find out what kind of performance you can get from Linux and SSDs when you aren't held back by several hundred thousand lines of opaque software.

You can read about BitCask's design in this paper written by Justin Sheehy and David Smith.

Sketchy numbers

I haven't done a mixed read/write test with 99 and 99.9 percentile latency recorded which is where a disk based system really has to prove its worth due to locking and contention for IO. In BitCask's case that is locking on the in memory KeyDir structure. More on the KeyDir after these sketchy numbers.

I also can't get any meaningful numbers until I implement compaction which will suck up IO in the background.

My i5 desktop with an SSD (Crucial M4 connected via SATA II) was doing 85k inserts a second with a 64 byte key and a 1.5k value (compressed down from 2k). Due to buffered IO the disk wasn't occupied the entire time and I am not sure if the write process was being blocked due to all the pages it dirtied. I do suspect that it was bottlenecked on the software side, but with that kind of performance it is not worth optimizing since compaction will need capacity as well.

For reads I got several hundred thousand reads per second if the working set fit in memory, but when the working set is 4x RAM it falls down to the speed of the SSD which was initially 2200 reads. I am picking keys at random so there is no locality and page caching is performing at its worst.

iostat showed that at 2200 reads the disk was reading 180 megabytes/sec. Disabling read ahead using "sudo blockdev --setra 0 /dev/sda" boosted the number of reads to 5500 per second, but that is still far less than expected. iostat shows only 23 megabytes/sec coming off of the SSD so I am skeptical that everything is working as it should. I also changed the scheduler from CFQ to NOOP and that didn't seem to have any effect. I also used "hdparm -A0 /dev/sda" to disable look-ahead (hardware based read-ahead), but that didn't seem to have an effect.

Evan Jones pointed out that those look like single threaded numbers and sure enough the load generating thread was waiting for the callback result. Once I fixed that I started getting 40k read/sec. CPU utilization is 160% and kswapd is using 13%. iostat reports that I am pulling in 180 megabytes/sec and doing 47k tps.

The final numbers are 40,846 reads/sec for the first million and 42,876 reads/sec for the second million after warmup. That is very close to the advertised 45,000 4 kilobyte random read capacity advertised. You need to take into account that the reads can cross page boundaries and that may have an effect.

The only other parameter worth mentioning is that I left 28 gigabytes of unpartitioned space on the SSD so it would always have free space to work with.

Zipfian and Scrambled Zipfian distribution results

I tried to test with a Zipfian distribution generated by org.apache.commons.math3.distribution.ZipfDistribution, but that turns out to be really slow as in 10s of seconds to generate a single sample when asking for a sample out of 20 million. Evan Jones bailed me out again by pointing me to a generator used by YCSB which generated 1 million integers in 465 milliseconds.

With a scrambled Zipfian distribution and the default Zipfian constant .99 it did the first million reads in 14.36 seconds at 69,637 read/sec and the next million in 12.36 seconds at 80,906  read/sec. The SSD was maxed out in terms of megabytes read and tps.

With a constant of .999 the first million reads took 13.12 seconds. The next million took 10.78, but after a few more million it seemed to start warming up and it got down to 9.5 seconds. I forgot that there might be a  cache warming behavior and haven't run this stuff long enough to let it show.

I switch from a scrambled distribution to a regular zipfian distribution. This simulates temporal locality where keys written or updated more recently are read more often than keys that were written a long time ago. This has much better caching behavior. The first million keys were read in 9.12 seconds, but after the cache warmed up it was reading a million keys in 5 seconds. Another interesting thing to note is that the SSD was not completely saturated. It was only doing 115 megabytes/sec and 30k tps. CPU utilization was only 330% so I am not sure what the bottleneck was.

I switched to JDK 7u3 and added -XX:ParGCCardsPerStrideChunk=4096 -XX:+UseCondCardMark as VM options. CPU utilization still tops out at 330%, but it started reading a million keys in 4.6 seconds. CPU utilization is very steady and GC doesn't seem to be an issue. The SSD still has spare capacity as well.

Next steps

KeyDir memory efficiency

See the BitCask paper top of page 3 for a description of KeyDir.

In BitCask every key is stored in memory along with a pointer to the value on disk. It relies on file system caching to speed up access and avoid performing IOs for every key lookup. This is really handy if you want to embed BitCask in another application that is going to need to allocate large quantities of memory since you don't need to figure out how to best divvy up memory for each component for each workload.

The number of keys that can be stored in memory is the upper limit on how much data you can fit on a node assuming you have a storage system like an SSD that can brute force 45k reads per second. If you have some locality in your data then you can potentially stretch those 45k reads farther if some memory is left for the page cache. Hopefully you will get a gradual performance degradation as you add keys and less memory is available for the page cache until it starts to drop off a cliff when the SSD can't provide the necessary IOPs.

I am using a java.util.TreeMap with byte array keys and values to store KeyDir entries. I opted for TreeMap over concurrent options like java.util.concurrent.ConcurrentHashMap and java.util.concurrent.ConcurrentSkipListMap due to space efficiency. I am pairing it with java.util.concurrent.locks.ReentrantReadWriteLock. For a disk based data structure I think I am getting sufficient throughput. It may be a bottleneck for workloads where everything is in the page cache, but I am optimizing for a workload where JitCask is a secondary store for some other in memory database.

If parallelism becomes an issue I can also partition the KeyDir to split the lock.

I thought about using Trove, but I found the documentation confusing. It doesn't have a byte array to byte array map type and I can't tell if there is an advantage to using one of the custom maps. It is also a hash map and not a tree map and I am not sure what the space efficiency implication is there because the implementation had some weird things like cells containing deleted entries. I know that some of the hash map implementations I have looked at are less space efficient then a tree map, but they are based on chained hashing and not open-address hashing like khash that is used in BitCask.

First some napkin math on how much memory it should take to store a key in an in memory red black tree.

4       -byte length prefix
64     -byte key
21     - byte key dir
25     -byte red black tree node, parent ptr, left child ptr, right child ptr, 1-byte color

So 114 bytes per key. On an 8 gigabyte heap I would expect to store 75 million keys. Let's assume that the allocator is doing power of two allocation so it is really 128 bytes per key and we actually expect to only store 67 million keys.

Here is what actually happened

Inserted 1 million 64 byte keys in 1 seconds
Inserted 2 million 64 byte keys in 3 seconds
Inserted 3 million 64 byte keys in 5 seconds
Inserted 4 million 64 byte keys in 7 seconds
Inserted 5 million 64 byte keys in 9 seconds
Inserted 6 million 64 byte keys in 11 seconds
Inserted 7 million 64 byte keys in 13 seconds
Inserted 8 million 64 byte keys in 15 seconds
Inserted 9 million 64 byte keys in 17 seconds
Inserted 10 million 64 byte keys in 19 seconds
Inserted 11 million 64 byte keys in 21 seconds
Inserted 12 million 64 byte keys in 23 seconds
Inserted 13 million 64 byte keys in 25 seconds
Inserted 14 million 64 byte keys in 27 seconds
Inserted 15 million 64 byte keys in 30 seconds
Inserted 16 million 64 byte keys in 32 seconds
Inserted 17 million 64 byte keys in 34 seconds
Inserted 18 million 64 byte keys in 36 seconds
Inserted 19 million 64 byte keys in 38 seconds
Inserted 20 million 64 byte keys in 40 seconds
Inserted 21 million 64 byte keys in 42 seconds
Inserted 22 million 64 byte keys in 44 seconds
Inserted 23 million 64 byte keys in 46 seconds
Inserted 24 million 64 byte keys in 48 seconds
Inserted 25 million 64 byte keys in 50 seconds
Inserted 26 million 64 byte keys in 52 seconds
Inserted 27 million 64 byte keys in 54 seconds
Inserted 28 million 64 byte keys in 56 seconds
Inserted 29 million 64 byte keys in 58 seconds
Inserted 30 million 64 byte keys in 60 seconds
Inserted 31 million 64 byte keys in 62 seconds
Inserted 32 million 64 byte keys in 64 seconds
Inserted 33 million 64 byte keys in 73 seconds
Inserted 34 million 64 byte keys in 74 seconds
Inserted 35 million 64 byte keys in 84 seconds
Inserted 36 million 64 byte keys in 86 seconds
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe7b7639948, pid=2637, tid=140633070393088
#
# JRE version: 6.0_23-b23
# Java VM: OpenJDK 64-Bit Server VM (20.0-b11 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea6 1.11pre
# Distribution: Ubuntu 11.10, package 6b23~pre11-0ubuntu1.11.10.2
# Problematic frame:
# V  [libjvm.so+0x481948]  instanceKlass::oop_follow_contents(oopDesc*)+0x1e8
#
# An error report file with more information is saved as:
# /home/aweisberg/git/JitCask/hs_err_pid2637.log
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-6/
#
Ignoring for a moment that it shit the bed halfway through, that is only 36 million keys. This was with -Xmx8g and -Xms8g. The crash happened during a Full GC and the last recorded GC was
2012-04-15T20:28:39.958-0400: 79.204: [Full GC [PSYoungGen: 1699008K->0K(2255104K)] [PSOldGen: 5417596K->5431256K(5592448K)] 7116604K->5431256K(7847552K) [PSPermGen: 7632K->7632K(21248K)], 7.6578630 secs] [Times: user=7.79 sys=0.00, real=7.65 secs]
A 2 gigabyte young generation and only a 5.4 gigabyte old generation.

But... the second time I ran I got past 36 million keys! Throughput dove off a cliff due to GC, but it went to 45 million keys in 168 seconds and after 46 million keys in 221 seconds it had essentially stopped. You can see the full output here. Every GC was a full GC and took 7+ seconds.

For kicks I set -XX:MaxNewSize=256m to try and ensure that the old gen got more space. That was a small improvement. It got to 50 million keys before falling off of a cliff. I also tried -XX:+UseCompressedOops to see what kind of mileage I could get out of that. You can do a similar trick on the native side by throwing away the non-information bearing two-bytes of each pointer. Alas this segfaulted twice in a row at the point it normally falls off of a cliff. The segfault is not in the same place every time. Compressed ordinary object pointers seems to have no noticable effect on space efficiency.

I expected on heap KeyDir storage to be an issue going in, but I was too lazy to write the thing in C++ and deal with all the third party dependencies it takes to make C++ usable. If the space efficiency wasn't enough of a reason not to store data on the Java heap then the pause times are what clinch it. I can't see this getting any better with the larger heaps you would want to use in the real world.

I don't know when/if I will get to solving this. It's depressing to think about writing a compacting allocator and red black tree using Java ByteBuffers. I wish Java and the JVM didn't treat native memory as a second class citizen. JNI is also depressing.

Evan Jones sent me a link to SILT: A Memory-Efficient, High-Performance Key-Value Store. I will give it a read this evening.

Compaction

This is pretty straightforward. Scan old files and remove everything that has been deleted and output the cleaned up version to a new file. Every time a key is moved to the new file update the KeyDir to point to the new location. Once an old file is no longer needed because the KeyDir no longer points to it you delete it.

The only tricky thing to get right is throttling and scheduling the compaction process so that it interleaves well with concurrent puts/gets/deletes. If you are targeting an SSD you can probably do this pretty well by taking advantage of the small seek time and excellent random IO capacity. There is also potential for a lot of fiddly config around the behavior of compaction such as blackout times and triggers for when compaction should begin and what should be compacted.

Durability

Right now when you do a put/remove you don't know when the put/remove has become durable. Ideally callbacks should only be invoked once the put/remove is durable and clients can choose to ignore the callback if they don't care when/if it becomes durable.

Like compaction this is not hard and it is something I have made work for command logging. I will be doing periodic group commit.

Key iteration

It would be nice to at least be able to enumerate the full set of keys. I am still pondering how to do this in a crash safe way that ensures that no key is seen twice while iterating. What happens if there is a concurrent get/put with key iteration?

It seems like I should be able to support range scans since I am storing the keys in a tree instead of a hash. I don't see using a hash as a huge win in memory.

Hint files

See the BitCask paper page 4 for a definition.

Restoring a large (30 gigabyte) key set is not speedy. After much experimentation the best number I got was 121 seconds.

Initially I saw 25 megabytes/sec with 12% CPU utilization. Re-enabling read ahead bumped that to 50 megabytes/sec and 30% CPU utilization. Clearly I needed to add a thread to invoke MappedByteBuffer.load(). I disabled read ahead and had a separate thread invoke MappedbyteByteBuffer.load() before processing each cask.

Oddly enough that didn't seem to help. iostat reported brief read spikes to 250 megabytes/sec, but then it went back down to 25 megabytes/sec. I have 5 gigs of page cache so it seems like loading the entire minicask should work. Reenabling read ahead brought me back to where I was with 30% CPU utilization and 50 megabytes/sec, but I couldn't seem to get any mileage out of preloading the data even though the IO appears to be happening.

Running iostat -m 60 to iron out the spikes shows an average read of 210 megabyte/sec without MappedbyteByteBuffer.load() and a read ahead of 512 blocks. Adding back the load made performance slightly worse with 180 megabytes/sec, or the same if you consider that the sample size is 1.

So to summarize, best performance is read ahead 512 blocks with no MBB.load(), with the MBB.load() performance is slightly worse, and with no read ahead and MBB.load() there is almost no improvement. I think that read ahead is a better choice in this instance, but that is very inconvenient because it requires root permission. In production I would enable read ahead on startup and then disable it once all the sequential reads are done.

When I go to do compaction I will have to see if issuing reads through read instead of mmap allows me to do read ahead from user space. The problem with mmap is that the thread touching each page can only fault 4k at a time if read ahead is disabled. If I can do read ahead from user space using read I can switch to that for loading the database at startup.

Based on what I am seeing loading the raw data I will probably need to parallelize the KeyDir loading if I am loading keys from a hint file.

9 comments:

  1. Interesting work.

    Maybe Alexey's patch for OpenJDK might help - http://blog.ragozin.info/2012/03/secret-hotspot-option-improving-gc.html

    ReplyDelete
  2. BTW didn't LevelDB or BDB work for you? (https://github.com/btu/voldemort/tree/master/contrib/leveldb)

    ReplyDelete
  3. Hi Ashwin,

    The benchmarks I have seen for LevelDB indicate that there is still some work to be done.

    http://www.acunu.com/blogs/andy-twigg/benchmarking-leveldb/
    https://github.com/m1ch1/mapkeeper/wiki/MySQL-vs.-LevelDB

    I know Basho has done some work to improve LevelDB's locking situation but I am not sure if there are benchmarks that deal with largish data sets that show the improvement. I know there are still no bloom filters.

    Consistent performance is far more important than peak throughput to me. Having reads degrade when compaction falls behind is not something I am comfortable with because that also means that compaction will have less IO available to catch up. In general that is something about LSM trees that concerns me.

    If you give up range scans (not completely, it is possible to enumerate keys in order) it is easy to outperform LevelDB, and doing my own implementation gives me the opportunity to do some MVCCish stuff which is important when dealing with data sets that are several times larger then available RAM. Point in time snapshots are something I would like to support.

    BDB Java does not have a usable license. Berkley DB C is not log structured.

    The GC pause times are not bad when the heap has enough free space. It is only when mostly full that they get bad because it is constantly doing Full GCs. The thing about red black trees is that you can have zero fragmentation and no pause times whatsoever and you get a significant space efficiency boost as well. Java was great for RAD and it isn't an order of magnitude worse in terms of space efficiency but a 40% loss in space efficiency is too much IMO.

    I saw that blog post go by a while ago. We don't run Volt with heaps beyond 1 or 2g so GC has never been an issue. Another option I need to try is -XX:+UseCondCardMark.
    https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card

    Ariel

    ReplyDelete
  4. "I know there are still no bloom filters". Was just committed 40 mins ago (from the time of my post).
    http://code.google.com/p/leveldb/source/detail?r=85584d497e7b354853b72f450683d59fcf6b9c5c

    ReplyDelete
    Replies
    1. Very cool. I see that there are a couple of new things since I last checked like write batches. Looks like a lot of work has taken place.

      There are several things that still concern me like how LevelDB handles keeping snapshots across restarts if the handle that is returned is an opaque object. Also write batches can be used to implement group commit, but they are not the same thing. They are close enough until you start aiming for sub-millisecond latency because you have a battery backed controller.

      I will give it another try.

      Delete
  5. So MappedByteBuffer.load() having no effect?

    Yeah about that http://cr.openjdk.java.net/~alanb/7168505/webrev/src/share/classes/java/nio/MappedByteBuffer.java.sdiff.html

    Apparently the compiler elides the loop because it is side effect free!

    ReplyDelete
  6. Tombstone entries are inevitable in any open addressing scheme - a moment's thought will reveal why (when does a reader know to stop probing a chain?).

    ReplyDelete
  7. You should really consider fastutil for a spaceefficient map implementation.

    http://fastutil.dsi.unimi.it/

    ReplyDelete
    Replies
    1. That is an interesting library, thanks for bringing it up. It still doesn't solve the issue of garbage collection. We have a zero fragmentation red black tree implementation in Volt that I could use that gives consistent performance and is only a tiny bit larger. I can probably adapt either to the zero fragmentation approach.

      Delete