Skip to content

Commit

Permalink
Don't hold DB mutex for block cache entry stat scans (facebook#8538)
Browse files Browse the repository at this point in the history
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: facebook#8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
  • Loading branch information
pdillinger authored and facebook-github-bot committed Jul 16, 2021
1 parent 1e5b631 commit df5dc73
Show file tree
Hide file tree
Showing 10 changed files with 248 additions and 93 deletions.
35 changes: 25 additions & 10 deletions cache/cache_entry_stats.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ namespace ROCKSDB_NAMESPACE {
template <class Stats>
class CacheEntryStatsCollector {
public:
// Gathers stats and saves results into `stats`
// Gather and save stats if saved stats are too old. (Use GetStats() to
// read saved stats.)
//
// Maximum allowed age for a "hit" on saved results is determined by the
// two interval parameters. Both set to 0 forces a re-scan. For example
Expand All @@ -61,10 +62,9 @@ class CacheEntryStatsCollector {
// Justification: scans can vary wildly in duration, e.g. from 0.02 sec
// to as much as 20 seconds, so we want to be able to cap the absolute
// and relative frequency of scans.
void GetStats(Stats *stats, int min_interval_seconds,
int min_interval_factor) {
void CollectStats(int min_interval_seconds, int min_interval_factor) {
// Waits for any pending reader or writer (collector)
std::lock_guard<std::mutex> lock(mutex_);
std::lock_guard<std::mutex> lock(working_mutex_);

uint64_t max_age_micros =
static_cast<uint64_t>(std::max(min_interval_seconds, 0)) * 1000000U;
Expand All @@ -79,19 +79,28 @@ class CacheEntryStatsCollector {
uint64_t start_time_micros = clock_->NowMicros();
if ((start_time_micros - last_end_time_micros_) > max_age_micros) {
last_start_time_micros_ = start_time_micros;
saved_stats_.BeginCollection(cache_, clock_, start_time_micros);
working_stats_.BeginCollection(cache_, clock_, start_time_micros);

cache_->ApplyToAllEntries(saved_stats_.GetEntryCallback(), {});
cache_->ApplyToAllEntries(working_stats_.GetEntryCallback(), {});
TEST_SYNC_POINT_CALLBACK(
"CacheEntryStatsCollector::GetStats:AfterApplyToAllEntries", nullptr);

uint64_t end_time_micros = clock_->NowMicros();
last_end_time_micros_ = end_time_micros;
saved_stats_.EndCollection(cache_, clock_, end_time_micros);
working_stats_.EndCollection(cache_, clock_, end_time_micros);
} else {
saved_stats_.SkippedCollection();
working_stats_.SkippedCollection();
}
// Copy to caller

// Save so that we don't need to wait for an outstanding collection in
// order to make of copy of the last saved stats
std::lock_guard<std::mutex> lock2(saved_mutex_);
saved_stats_ = working_stats_;
}

// Gets saved stats, regardless of age
void GetStats(Stats *stats) {
std::lock_guard<std::mutex> lock(saved_mutex_);
*stats = saved_stats_;
}

Expand Down Expand Up @@ -129,6 +138,7 @@ class CacheEntryStatsCollector {
Cache::Priority::HIGH);
if (!s.ok()) {
assert(h == nullptr);
delete new_ptr;
return s;
}
}
Expand All @@ -145,6 +155,7 @@ class CacheEntryStatsCollector {
private:
explicit CacheEntryStatsCollector(Cache *cache, SystemClock *clock)
: saved_stats_(),
working_stats_(),
last_start_time_micros_(0),
last_end_time_micros_(/*pessimistic*/ 10000000),
cache_(cache),
Expand All @@ -154,10 +165,14 @@ class CacheEntryStatsCollector {
delete static_cast<CacheEntryStatsCollector *>(value);
}

std::mutex mutex_;
std::mutex saved_mutex_;
Stats saved_stats_;

std::mutex working_mutex_;
Stats working_stats_;
uint64_t last_start_time_micros_;
uint64_t last_end_time_micros_;

Cache *const cache_;
SystemClock *const clock_;
};
Expand Down
1 change: 1 addition & 0 deletions cache/lru_cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include <string>

#include "cache/sharded_cache.h"
#include "port/lang.h"
#include "port/malloc.h"
#include "port/port.h"
#include "rocksdb/secondary_cache.h"
Expand Down
85 changes: 78 additions & 7 deletions db/db_block_cache_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

#include "cache/cache_entry_roles.h"
#include "cache/lru_cache.h"
#include "db/column_family.h"
#include "db/db_test_util.h"
#include "port/stack_trace.h"
#include "rocksdb/table.h"
Expand Down Expand Up @@ -152,13 +153,15 @@ class DBBlockCacheTest : public DBTestBase {
}

#ifndef ROCKSDB_LITE
const std::array<size_t, kNumCacheEntryRoles>& GetCacheEntryRoleCountsBg() {
const std::array<size_t, kNumCacheEntryRoles> GetCacheEntryRoleCountsBg() {
// Verify in cache entry role stats
ColumnFamilyHandleImpl* cfh =
static_cast<ColumnFamilyHandleImpl*>(dbfull()->DefaultColumnFamily());
InternalStats* internal_stats_ptr = cfh->cfd()->internal_stats();
return internal_stats_ptr->TEST_GetCacheEntryRoleStats(/*foreground=*/false)
.entry_counts;
InternalStats::CacheEntryRoleStats stats;
internal_stats_ptr->TEST_GetCacheEntryRoleStats(&stats,
/*foreground=*/false);
return stats.entry_counts;
}
#endif // ROCKSDB_LITE
};
Expand All @@ -170,7 +173,13 @@ TEST_F(DBBlockCacheTest, IteratorBlockCacheUsage) {
auto options = GetOptions(table_options);
InitTable(options);

std::shared_ptr<Cache> cache = NewLRUCache(0, 0, false);
LRUCacheOptions co;
co.capacity = 0;
co.num_shard_bits = 0;
co.strict_capacity_limit = false;
// Needed not to count entry stats collector
co.metadata_charge_policy = kDontChargeCacheMetadata;
std::shared_ptr<Cache> cache = NewLRUCache(co);
table_options.block_cache = cache;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
Reopen(options);
Expand All @@ -194,7 +203,13 @@ TEST_F(DBBlockCacheTest, TestWithoutCompressedBlockCache) {
auto options = GetOptions(table_options);
InitTable(options);

std::shared_ptr<Cache> cache = NewLRUCache(0, 0, false);
LRUCacheOptions co;
co.capacity = 0;
co.num_shard_bits = 0;
co.strict_capacity_limit = false;
// Needed not to count entry stats collector
co.metadata_charge_policy = kDontChargeCacheMetadata;
std::shared_ptr<Cache> cache = NewLRUCache(co);
table_options.block_cache = cache;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
Reopen(options);
Expand Down Expand Up @@ -265,7 +280,13 @@ TEST_F(DBBlockCacheTest, TestWithCompressedBlockCache) {

ReadOptions read_options;
std::shared_ptr<Cache> compressed_cache = NewLRUCache(1 << 25, 0, false);
std::shared_ptr<Cache> cache = NewLRUCache(0, 0, false);
LRUCacheOptions co;
co.capacity = 0;
co.num_shard_bits = 0;
co.strict_capacity_limit = false;
// Needed not to count entry stats collector
co.metadata_charge_policy = kDontChargeCacheMetadata;
std::shared_ptr<Cache> cache = NewLRUCache(co);
table_options.block_cache = cache;
table_options.no_block_cache = false;
table_options.block_cache_compressed = compressed_cache;
Expand Down Expand Up @@ -944,10 +965,15 @@ TEST_F(DBBlockCacheTest, CacheCompressionDict) {
}

static void ClearCache(Cache* cache) {
auto roles = CopyCacheDeleterRoleMap();
std::deque<std::string> keys;
Cache::ApplyToAllEntriesOptions opts;
auto callback = [&](const Slice& key, void* /*value*/, size_t /*charge*/,
Cache::DeleterFn /*deleter*/) {
Cache::DeleterFn deleter) {
if (roles.find(deleter) == roles.end()) {
// Keep the stats collector
return;
}
keys.push_back(key.ToString());
};
cache->ApplyToAllEntries(callback, opts);
Expand Down Expand Up @@ -1126,6 +1152,9 @@ TEST_F(DBBlockCacheTest, CacheEntryRoleStats) {
&h, Cache::Priority::HIGH));
ASSERT_GT(cache->GetUsage(), cache->GetCapacity());
expected = {};
// For CacheEntryStatsCollector
expected[static_cast<size_t>(CacheEntryRole::kMisc)] = 1;
// For Fill-it-up
expected[static_cast<size_t>(CacheEntryRole::kMisc)]++;
// Still able to hit on saved stats
EXPECT_EQ(prev_expected, GetCacheEntryRoleCountsBg());
Expand All @@ -1134,6 +1163,48 @@ TEST_F(DBBlockCacheTest, CacheEntryRoleStats) {
EXPECT_EQ(expected, GetCacheEntryRoleCountsBg());

cache->Release(h);

// Now we test that the DB mutex is not held during scans, for the ways
// we know how to (possibly) trigger them. Without a better good way to
// check this, we simply inject an acquire & release of the DB mutex
// deep in the stat collection code. If we were already holding the
// mutex, that is UB that would at least be found by TSAN.
int scan_count = 0;
SyncPoint::GetInstance()->SetCallBack(
"CacheEntryStatsCollector::GetStats:AfterApplyToAllEntries",
[this, &scan_count](void*) {
dbfull()->TEST_LockMutex();
dbfull()->TEST_UnlockMutex();
++scan_count;
});
SyncPoint::GetInstance()->EnableProcessing();

// Different things that might trigger a scan, with mock sleeps to
// force a miss.
env_->MockSleepForSeconds(10000);
dbfull()->DumpStats();
ASSERT_EQ(scan_count, 1);

env_->MockSleepForSeconds(10000);
ASSERT_TRUE(
db_->GetMapProperty(DB::Properties::kBlockCacheEntryStats, &values));
ASSERT_EQ(scan_count, 2);

env_->MockSleepForSeconds(10000);
std::string value_str;
ASSERT_TRUE(
db_->GetProperty(DB::Properties::kBlockCacheEntryStats, &value_str));
ASSERT_EQ(scan_count, 3);

env_->MockSleepForSeconds(10000);
ASSERT_TRUE(db_->GetProperty(DB::Properties::kCFStats, &value_str));
// To match historical speed, querying this property no longer triggers
// a scan, even if results are old. But periodic dump stats should keep
// things reasonably updated.
ASSERT_EQ(scan_count, /*unchanged*/ 3);

SyncPoint::GetInstance()->DisableProcessing();
SyncPoint::GetInstance()->ClearAllCallBacks();
}
EXPECT_GE(iterations_tested, 1);
}
Expand Down
47 changes: 36 additions & 11 deletions db/db_impl/db_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
#include "memtable/hash_linklist_rep.h"
#include "memtable/hash_skiplist_rep.h"
#include "monitoring/in_memory_stats_history.h"
#include "monitoring/instrumented_mutex.h"
#include "monitoring/iostats_context_imp.h"
#include "monitoring/perf_context_imp.h"
#include "monitoring/persistent_stats_history.h"
Expand Down Expand Up @@ -944,18 +945,31 @@ void DBImpl::DumpStats() {
if (shutdown_initiated_) {
return;
}

TEST_SYNC_POINT("DBImpl::DumpStats:StartRunning");
{
InstrumentedMutexLock l(&mutex_);
for (auto cfd : *versions_->GetColumnFamilySet()) {
if (cfd->initialized()) {
// Release DB mutex for gathering cache entry stats. Pass over all
// column families for this first so that other stats are dumped
// near-atomically.
InstrumentedMutexUnlock u(&mutex_);
cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
}
}

const std::string* property = &DB::Properties::kDBStats;
const DBPropertyInfo* property_info = GetPropertyInfo(*property);
assert(property_info != nullptr);
assert(!property_info->need_out_of_mutex);
default_cf_internal_stats_->GetStringProperty(*property_info, *property,
&stats);

property = &DB::Properties::kCFStatsNoFileHistogram;
property_info = GetPropertyInfo(*property);
assert(property_info != nullptr);
assert(!property_info->need_out_of_mutex);
for (auto cfd : *versions_->GetColumnFamilySet()) {
if (cfd->initialized()) {
cfd->internal_stats()->GetStringProperty(*property_info, *property,
Expand All @@ -966,6 +980,7 @@ void DBImpl::DumpStats() {
property = &DB::Properties::kCFFileHistogram;
property_info = GetPropertyInfo(*property);
assert(property_info != nullptr);
assert(!property_info->need_out_of_mutex);
for (auto cfd : *versions_->GetColumnFamilySet()) {
if (cfd->initialized()) {
cfd->internal_stats()->GetStringProperty(*property_info, *property,
Expand Down Expand Up @@ -3264,16 +3279,21 @@ bool DBImpl::GetProperty(ColumnFamilyHandle* column_family,
}
return ret_value;
} else if (property_info->handle_string) {
InstrumentedMutexLock l(&mutex_);
return cfd->internal_stats()->GetStringProperty(*property_info, property,
value);
if (property_info->need_out_of_mutex) {
return cfd->internal_stats()->GetStringProperty(*property_info, property,
value);
} else {
InstrumentedMutexLock l(&mutex_);
return cfd->internal_stats()->GetStringProperty(*property_info, property,
value);
}
} else if (property_info->handle_string_dbimpl) {
std::string tmp_value;
bool ret_value = (this->*(property_info->handle_string_dbimpl))(&tmp_value);
if (ret_value) {
*value = tmp_value;
if (property_info->need_out_of_mutex) {
return (this->*(property_info->handle_string_dbimpl))(value);
} else {
InstrumentedMutexLock l(&mutex_);
return (this->*(property_info->handle_string_dbimpl))(value);
}
return ret_value;
}
// Shouldn't reach here since exactly one of handle_string and handle_int
// should be non-nullptr.
Expand All @@ -3291,9 +3311,14 @@ bool DBImpl::GetMapProperty(ColumnFamilyHandle* column_family,
if (property_info == nullptr) {
return false;
} else if (property_info->handle_map) {
InstrumentedMutexLock l(&mutex_);
return cfd->internal_stats()->GetMapProperty(*property_info, property,
value);
if (property_info->need_out_of_mutex) {
return cfd->internal_stats()->GetMapProperty(*property_info, property,
value);
} else {
InstrumentedMutexLock l(&mutex_);
return cfd->internal_stats()->GetMapProperty(*property_info, property,
value);
}
}
// If we reach this point it means that handle_map is not provided for the
// requested property
Expand Down
Loading

0 comments on commit df5dc73

Please sign in to comment.