-
Notifications
You must be signed in to change notification settings - Fork 178
mv async close
- merged to master October 17, 2013
- code complete October 14, 2013
- development started October 4, 2013
This branch contains the first of two fixes first released in the 1.4.2-turner branch.
PosixMmapFile, when used asynchronously by the recovery log, has a race condition related to close operations. This causes syslog messages relating to fadvise and fdatasync errors. It also implies the recovery log likely has minor corruption at the file end (which is only critical if the file is actually used in a database recovery).
(It has been suggested that the only corruption in the recovery log might be uninitialized data at the end of the file that the failed ftruncate() call would normally remove. If true, the recovery log processing would simply ignore this trailing content since would not have a valid CRC. The suggestion was based upon code elsewhere that closed a file handle after a file was mapped into memory, and the mapping continued to be available. No research has been made to prove / disprove this suggestion.)
This is the second problem in the PosixMmapFile async code. The first and more critical PosixMmapFile async problem was addressed in a previous patch. That patch corrected PosixMmapFile::UnmapCurrentRegion() to only use async unmapping if the file was a recovery log. The problem that remained (undetected) was that Env::Default()->Schedule() did not place sequential file operations on the same thread. It was therefore possible for an unmap with close operation to process prior to simple unmap operations that precede it.
The PosixMmapFile async code was originally written when Env::Default()->Schedule() had only one destination thread. Therefore the was no possibility of the final write/close operation preceding any intermediate write operations. Riak later created multiple background threads and broke the assumption.
This branch adds logic to maintain a reference count of active objects related to a file. The PosixMmapFile object is one type of "active object". A background BGCloseInfo object is the other. The reference count is malloc'd in memory and works similar to an std::shared_ptr. The difference is that std::shared_ptr is not guaranteed thread safe and this reference counter is. Whatever object decrements the reference counter to zero also closes the file. This guarantees that all pending writes have completed (in any order).
The malloc'd ref_count_ is actually a 2 element uint64_t array. The first element is the reference count. The count is incremented and decremented via atomic operations. The second element is the final file size. It is set upon call to PosixMmapFile::Close(). The file is extended in chunks during write operations to facilitate memory mapped operations. The object that closes the file will also adjust the file size is to match the actual count of bytes used.
Files edited for this fix:
-
leveldb/include/leveldb/atomics.h: this file originated in eleveldb for the 1.3 release. It is also part of other pending changes to the leveldb 2.0 release. It is an already reviewed, production file.
-
leveldb/util/env_posix.cc: logic changes were made to the PosixMmapFile class relating to the use of ref_count_. There was a previous mess of four special purpose background task functions: BGFileCloser(), BGFileCloser2(), BGFileUnmapper(), and BGFileUnmapper2. The mess is now reduced to one, general purpose BGFileUnmapper2 function. The asynchronous close is contained within a new function ReleaseRef() that manages the ref_count_ atomically and also performs the file resize / close when count is zero.