-
Notifications
You must be signed in to change notification settings - Fork 178
mv tuning6
- merged to develop
- code complete February 11, 2014
- development started February 7, 2014
This page discusses a third set of tunings made to basho/leveldb as part of the final Riak 2.0 release preparation.
Leveldb memory allocations in Riak 1.x were static. The allocations included a guess for future allocations that might be necessary if Riak moved additional vnodes (databases) to the current server during runtime. Riak 2.0 includes leveldb's flexcache feature. Flexcache dynamically adjust memory allocations between vnodes (databases) during execution. Yeah! This allows Riak 2.0 leveldb to use more server memory at any given time for its file cache and block cache. And to adjust to changes in vnode count, up or down, while continuing to maximize server memory use.
Riak 2.0 testing demonstrated that very large block cache can negatively impact performance under certain loads and hard drive arrays. The very large block cache opportunity is only now available thanks to flexcache. The root cause of the performance impact is that the block cache is crowding out memory previously available to the operating system page cache.
This tuning branch puts limits on how large the leveldb block cache can grow. The limits are based upon estimating optimal page cache allocations versus block cache. The leveldb file cache growth is not limited by the new logic. A miss against the file cache is more costly than misses against either the block cache or the page cache.
NOTE: The block cache has an underlying design deficiency. A "fix" would require too much code change and add too much risk at this stage of the Riak 2.0 release. The design deficiency is that the block cache never actively deletes stale blocks. leveldb only releases memory allocated to stale blocks once the entire block cache allocation is filled. The stale blocks can represent data blocks from files that have been erased from the file system and leveldb's accounting. The stale blocks therefore represent dead memory that could be used by the operating system for more useful things, such as the page cache. This tuning branch is a hack to mitigate, not solve the design deficiency.
All leveldb code is summarized into two components. The first component dynamically determines the block cache size. This is the core change. The second component is a series of small changes to track the disk size of all files in the file cache. The disk size of file in the file cache becomes the estimation used for "optimal page cache allocation" in the first component's changes.
These two files add a new database/vnode tuning parameter: block_cache_threshold. The parameter is used to set a limit to where page cache estimate can no longer rob memory from the block cache. File cache can still rob memory below this limit, but not the page cache.
The default value is 16Mbytes. Further testing has suggested that 32Mbytes might be better. But the further testing has not covered as many environment conditions as 16Mbytes … so 16Mbytes wins as of this writing.
The DoubleCache class is the home to the new dynamic block cache size adjustment.
The cache2.h file contains the class declaration changes. The logic requires two numbers: user designated limit to page cache adjustment and the current disk size of all files in the file cache. The latter serves as the estimation of "optimal page cache memory". The user designated limit is passed into DoubleCache as part of the Options structure used in the constructor. The file cache's disk size is maintained via atomic adds and subtracts to the volatile m_SizeCachedFiles memory variable.
cache2.cc contains the DoubleCache::GetCapacity() function where the block cache size logic exists. The block cache size is now a cascade of decisions:
- what is the total memory allocated to this database (vnode)
- subtract what is current used by file cache (size of file cache objects in memory)
- if greater than block_cache_threshold, subtract the file size of all cache objects from the portion greater than block_cache_threshold
- if the final number is less than 2Mbytes, use 2Mbytes instead
db/table_cache.h: class TableCache now carries a reference to the database (vnode) DoubleCache object. The DoubleCache object contains the disk size tracking interface. class TableAndFile is the primary object of the file cache. It now contains storage for tracking the DoubleCache object. This allows the file cache delete operations to update the disk size tracking.
db/table_cache.cc: The total disk size tracking code occurs in this file. TableCache::FindTable() increments the disk size tracking when creating a file cache object. DeleteEntry() decrements the disk size tracking during clean up of a deleted file cache object.
db/db_impl.cc, db/repair.cc, and tools/sst_scan.cc: changes to include DoubleCache object in the construction parameters of a TableCache.
struct EleveldbOptions' m_TotalMem member was an int. That is 32 bits and the number it is to contain is 64 bit memory size. Changed to size_t.
All other edit are normal stuff to add block_cache_threshold option parsing.
Normal changes needed to add block_cache_threshold as an option.
Normal changes needed to add block_cache_threshold as an option.