The next few posts are going to be quick brain dumps. I don't seem to be spending the time I should writing my blog ideas down.
In 1.2 we allocated all memory for indexes and out-of-band strings on the libc heap. Tuples in VoltDB are fixed size. Variable length fields like strings are stored indirectly on the heap if they can be more than 63 bytes according to the schema. If the schema declares the string to be < 63 bytes in length then space is allocated for all 63 bytes in the tuple even if the actual string is smaller. Each table allocates 2 megabyte blocks and fills them with tuples. When a tuple is deleted there is a per table free list of pointers to free tuples.
There were two issues with this approach. The first was fragmentation which was horrendous on Linux and non-existent on OS X. The second was an unresponsive RSS. Deleting data from VoltDB did not return any memory to the operating system making it difficult for users to determine when they are running out of memory.
John Hugg found that the default malloc settings resulted in all memory being allocated by sbrk instead of mmap. We changed the settings so that large allocations would mmaped so that unmapping them would return the memory to the OS. The nice thing about have libc call mmap for you is that it will fall back to sbrk if you run out of mmaps.
John Hugg found that the default malloc settings resulted in all memory being allocated by sbrk instead of mmap. We changed the settings so that large allocations would mmaped so that unmapping them would return the memory to the OS. The nice thing about have libc call mmap for you is that it will fall back to sbrk if you run out of mmaps.
The 1.2.1 release was mostly about addressing memory fragmentation and reporting. The first approach we took was to compact the storage for the tuples themselves. I added some meta-data for each block of tuples and made the free-list per block. The meta-data maintained a load factor for the table and for each block and when the load factor on the table crossed a threshhold the system would stop and merge tuple blocks until the load factor was acceptable. The per block load factor helped with selecting the best candidate blocks for a merge. The advantage of this approach was that a bulk delete that deleted tuples in insertion order would be able to skip the merge process and free an entire block. Moving a tuple is not cheap because you have to update an arbitrary number of indexes every time you move a tuple.
That only put a small dent in RSS and did nothing for fragmentation. We did some memory profiling (alas, not smart enough to start with profiling) and found that the majority of the memory was going towards out of band strings and indexes. I added some memory pooling that eliminated the fragmentation, but did nothing to make the RSS shrink.
John Hugg advocated a different approach we call no-holes allocation that he ended up implementing for the new compacting hash and tree indexes. The no holes approach works for fixed size objects where any hole (created by a deletion) in the allocated memory space can be filled by moving an item from the end of the allocated space to fill the hole. For a tree this means that every time a node is deleted a node from the end of the allocated memory space is moved into its place and all the nodes pointing to the moved node have their pointers updated to point to the new location. Allocation and deletion have constant overhead in this case and blocks of memory are always completely compacted.
That only put a small dent in RSS and did nothing for fragmentation. We did some memory profiling (alas, not smart enough to start with profiling) and found that the majority of the memory was going towards out of band strings and indexes. I added some memory pooling that eliminated the fragmentation, but did nothing to make the RSS shrink.
John Hugg advocated a different approach we call no-holes allocation that he ended up implementing for the new compacting hash and tree indexes. The no holes approach works for fixed size objects where any hole (created by a deletion) in the allocated memory space can be filled by moving an item from the end of the allocated space to fill the hole. For a tree this means that every time a node is deleted a node from the end of the allocated memory space is moved into its place and all the nodes pointing to the moved node have their pointers updated to point to the new location. Allocation and deletion have constant overhead in this case and blocks of memory are always completely compacted.
Michael Ismert added compaction for string memory in 1.3. The approach we took there was to use tombstones that are never moved or compacted. The tuple points to a tombstone and the tombstone points to the variable length objects (VLO) that might be moved as part of a no holes allocation strategy. The VLO also points back to the tombstone.
Tombstones can't be compacted or returned to the OS, but they are usable as tombstones for any kind/size object for the life of the program. Every time a VLO is moved the backpointer to the tombstone is followed and the tombstone is updated. The VLO themselves are stored in a series of fixed size memory pools that use the no holes memory allocation strategy. Every time a VLO is deleted another is moved to take its place and the tombstone is updated.
Tombstones can't be compacted or returned to the OS, but they are usable as tombstones for any kind/size object for the life of the program. Every time a VLO is moved the backpointer to the tombstone is followed and the tombstone is updated. The VLO themselves are stored in a series of fixed size memory pools that use the no holes memory allocation strategy. Every time a VLO is deleted another is moved to take its place and the tombstone is updated.
I think the no-holes strategy is a good one when the number of updates per hole fill is constant. In the case of tuples where there is an opportunity for insertion order deletion and a penalty for updating every index then I think it makes more sense to be lazy. An advantage of being lazy is that you can do the compaction while the system is idle. Currently the system is %15 idle even when running TPC-C full bore and significantly more for a workload like voter. A no holes strategy impacts execution time if the system is idle, but not enough to effect end to end response time. Given that no one is complaining about speed it seems like the thing to optimize for is probably space efficiency and code complexity/reliability.
The no-holes strategy was significantly simpler to implement. Compaction ended being a lot of error prone code that interacted with undo logging and COW snapshots in weird ways. It was definitely hard to get right. It may be that the undo log and COW snapshots wouldn't have interacted conveniently with no-holes allocation either, but I haven't given it enough thought.
If I had to do it again I would start with no-holes allocation, benchmark, and go from there.
The no-holes strategy was significantly simpler to implement. Compaction ended being a lot of error prone code that interacted with undo logging and COW snapshots in weird ways. It was definitely hard to get right. It may be that the undo log and COW snapshots wouldn't have interacted conveniently with no-holes allocation either, but I haven't given it enough thought.
If I had to do it again I would start with no-holes allocation, benchmark, and go from there.
More on export overflow in another post.
No comments:
Post a Comment