Skip to content

Commit

Permalink
Compare memtable insert and flush count (#8288)
Browse files Browse the repository at this point in the history
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.

Pull Request resolved: facebook/rocksdb#8288

Test Plan: Pass all existing tests

Reviewed By: ajkr

Differential Revision: D28369194

fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
  • Loading branch information
siying authored and facebook-github-bot committed May 20, 2021
1 parent 94b4faa commit 2f1984d
Show file tree
Hide file tree
Showing 14 changed files with 134 additions and 36 deletions.
1 change: 1 addition & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* Added more fields to FilterBuildingContext with LSM details, for custom filter policies that vary behavior based on where they are in the LSM-tree.
* Added DB::Properties::kBlockCacheEntryStats for querying statistics on what percentage of block cache is used by various kinds of blocks, etc. using DB::GetProperty and DB::GetMapProperty. The same information is now dumped to info LOG periodically according to `stats_dump_period_sec`.
* Add an experimental Remote Compaction feature, which allows the user to run Compaction on a different host or process. The feature is still under development, currently only works on some basic use cases. The interface will be changed without backward/forward compatibility support.
* RocksDB would validate total entries read in flush, and compare with counter inserted into it. If flush_verify_memtable_count = true (default), flush will fail. Otherwise, only log to info logs.

### Performance Improvements
* BlockPrefetcher is used by iterators to prefetch data if they anticipate more data to be used in future. It is enabled implicitly by rocksdb. Added change to take in account read pattern if reads are sequential. This would disable prefetching for random reads in MultiGet and iterators as readahead_size is increased exponential doing large prefetches.
Expand Down
9 changes: 8 additions & 1 deletion db/builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Status BuildTable(
int job_id, const Env::IOPriority io_priority,
TableProperties* table_properties, Env::WriteLifeTimeHint write_hint,
const std::string* full_history_ts_low,
BlobFileCompletionCallback* blob_callback) {
BlobFileCompletionCallback* blob_callback, uint64_t* num_input_entries) {
assert((tboptions.column_family_id ==
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
tboptions.column_family_name.empty());
Expand All @@ -88,7 +88,10 @@ Status BuildTable(
std::unique_ptr<CompactionRangeDelAggregator> range_del_agg(
new CompactionRangeDelAggregator(&tboptions.internal_comparator,
snapshots));
uint64_t num_unfragmented_tombstones = 0;
for (auto& range_del_iter : range_del_iters) {
num_unfragmented_tombstones +=
range_del_iter->num_unfragmented_tombstones();
range_del_agg->AddTombstones(std::move(range_del_iter));
}

Expand Down Expand Up @@ -231,6 +234,10 @@ Status BuildTable(

TEST_SYNC_POINT("BuildTable:BeforeFinishBuildTable");
const bool empty = builder->IsEmpty();
if (num_input_entries != nullptr) {
*num_input_entries =
c_iter.num_input_entry_scanned() + num_unfragmented_tombstones;
}
if (!s.ok() || empty) {
builder->Abandon();
} else {
Expand Down
3 changes: 2 additions & 1 deletion db/builder.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ extern Status BuildTable(
TableProperties* table_properties = nullptr,
Env::WriteLifeTimeHint write_hint = Env::WLTH_NOT_SET,
const std::string* full_history_ts_low = nullptr,
BlobFileCompletionCallback* blob_callback = nullptr);
BlobFileCompletionCallback* blob_callback = nullptr,
uint64_t* num_input_entries = nullptr);

} // namespace ROCKSDB_NAMESPACE
52 changes: 27 additions & 25 deletions db/compaction/compaction_iterator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@
(snapshot_checker_ == nullptr || LIKELY(IsInEarliestSnapshot(seq))))

namespace ROCKSDB_NAMESPACE {

CompactionIterator::CompactionIterator(
InternalIterator* input, const Comparator* cmp, MergeHelper* merge_helper,
SequenceNumber last_sequence, std::vector<SequenceNumber>* snapshots,
Expand Down Expand Up @@ -73,7 +72,10 @@ CompactionIterator::CompactionIterator(
const std::atomic<int>* manual_compaction_paused,
const std::shared_ptr<Logger> info_log,
const std::string* full_history_ts_low)
: input_(input),
: input_(
input, cmp,
compaction ==
nullptr), // Now only need to count number of entries in flush.
cmp_(cmp),
merge_helper_(merge_helper),
snapshots_(snapshots),
Expand Down Expand Up @@ -130,13 +132,13 @@ CompactionIterator::CompactionIterator(
assert(timestamp_size_ == 0 || !full_history_ts_low_ ||
timestamp_size_ == full_history_ts_low_->size());
#endif
input_->SetPinnedItersMgr(&pinned_iters_mgr_);
input_.SetPinnedItersMgr(&pinned_iters_mgr_);
TEST_SYNC_POINT_CALLBACK("CompactionIterator:AfterInit", compaction_.get());
}

CompactionIterator::~CompactionIterator() {
// input_ Iterator lifetime is longer than pinned_iters_mgr_ lifetime
input_->SetPinnedItersMgr(nullptr);
input_.SetPinnedItersMgr(nullptr);
}

void CompactionIterator::ResetRecordCounts() {
Expand Down Expand Up @@ -189,7 +191,7 @@ void CompactionIterator::Next() {
// Only advance the input iterator if there is no merge output and the
// iterator is not already at the next record.
if (!at_next_) {
input_->Next();
AdvanceInputIter();
}
NextFromInput();
}
Expand Down Expand Up @@ -356,10 +358,10 @@ void CompactionIterator::NextFromInput() {
at_next_ = false;
valid_ = false;

while (!valid_ && input_->Valid() && !IsPausingManualCompaction() &&
while (!valid_ && input_.Valid() && !IsPausingManualCompaction() &&
!IsShuttingDown()) {
key_ = input_->key();
value_ = input_->value();
key_ = input_.key();
value_ = input_.value();
iter_stats_.num_input_records++;

Status pik_status = ParseInternalKey(key_, &ikey_, allow_data_in_errors_);
Expand Down Expand Up @@ -559,12 +561,12 @@ void CompactionIterator::NextFromInput() {
// The easiest way to process a SingleDelete during iteration is to peek
// ahead at the next key.
ParsedInternalKey next_ikey;
input_->Next();
AdvanceInputIter();

// Check whether the next key exists, is not corrupt, and is the same key
// as the single delete.
if (input_->Valid() &&
ParseInternalKey(input_->key(), &next_ikey, allow_data_in_errors_)
if (input_.Valid() &&
ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok() &&
cmp_->Equal(ikey_.user_key, next_ikey.user_key)) {
// Check whether the next key belongs to the same snapshot as the
Expand All @@ -578,7 +580,7 @@ void CompactionIterator::NextFromInput() {
// to handle the second SingleDelete

// First SingleDelete has been skipped since we already called
// input_->Next().
// input_.Next().
++iter_stats_.num_record_drop_obsolete;
++iter_stats_.num_single_del_mismatch;
} else if (has_outputted_key_ ||
Expand All @@ -600,9 +602,9 @@ void CompactionIterator::NextFromInput() {

++iter_stats_.num_record_drop_hidden;
++iter_stats_.num_record_drop_obsolete;
// Already called input_->Next() once. Call it a second time to
// Already called input_.Next() once. Call it a second time to
// skip past the second key.
input_->Next();
AdvanceInputIter();
} else {
// Found a matching value, but we cannot drop both keys since
// there is an earlier snapshot and we need to leave behind a record
Expand Down Expand Up @@ -670,7 +672,7 @@ void CompactionIterator::NextFromInput() {
}

++iter_stats_.num_record_drop_hidden; // rule (A)
input_->Next();
AdvanceInputIter();
} else if (compaction_ != nullptr &&
(ikey_.type == kTypeDeletion ||
(ikey_.type == kTypeDeletionWithTimestamp &&
Expand Down Expand Up @@ -706,7 +708,7 @@ void CompactionIterator::NextFromInput() {
if (!bottommost_level_) {
++iter_stats_.num_optimized_del_drop_obsolete;
}
input_->Next();
AdvanceInputIter();
} else if ((ikey_.type == kTypeDeletion ||
(ikey_.type == kTypeDeletionWithTimestamp &&
cmp_with_history_ts_low_ < 0)) &&
Expand All @@ -717,26 +719,26 @@ void CompactionIterator::NextFromInput() {
assert(!compaction_ || compaction_->KeyNotExistsBeyondOutputLevel(
ikey_.user_key, &level_ptrs_));
ParsedInternalKey next_ikey;
input_->Next();
AdvanceInputIter();
// Skip over all versions of this key that happen to occur in the same
// snapshot range as the delete.
//
// Note that a deletion marker of type kTypeDeletionWithTimestamp will be
// considered to have a different user key unless the timestamp is older
// than *full_history_ts_low_.
while (!IsPausingManualCompaction() && !IsShuttingDown() &&
input_->Valid() &&
(ParseInternalKey(input_->key(), &next_ikey, allow_data_in_errors_)
input_.Valid() &&
(ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok()) &&
cmp_->EqualWithoutTimestamp(ikey_.user_key, next_ikey.user_key) &&
(prev_snapshot == 0 ||
DEFINITELY_NOT_IN_SNAPSHOT(next_ikey.sequence, prev_snapshot))) {
input_->Next();
AdvanceInputIter();
}
// If you find you still need to output a row with this key, we need to output the
// delete too
if (input_->Valid() &&
(ParseInternalKey(input_->key(), &next_ikey, allow_data_in_errors_)
if (input_.Valid() &&
(ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok()) &&
cmp_->EqualWithoutTimestamp(ikey_.user_key, next_ikey.user_key)) {
valid_ = true;
Expand All @@ -755,7 +757,7 @@ void CompactionIterator::NextFromInput() {
// We encapsulate the merge related state machine in a different
// object to minimize change to the existing flow.
Status s =
merge_helper_->MergeUntil(input_, range_del_agg_, prev_snapshot,
merge_helper_->MergeUntil(&input_, range_del_agg_, prev_snapshot,
bottommost_level_, allow_data_in_errors_);
merge_out_iter_.SeekToFirst();

Expand Down Expand Up @@ -799,14 +801,14 @@ void CompactionIterator::NextFromInput() {
if (should_delete) {
++iter_stats_.num_record_drop_hidden;
++iter_stats_.num_record_drop_range_del;
input_->Next();
AdvanceInputIter();
} else {
valid_ = true;
}
}

if (need_skip) {
input_->Seek(skip_until);
SkipUntil(skip_until);
}
}

Expand Down
50 changes: 49 additions & 1 deletion db/compaction/compaction_iterator.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,49 @@ namespace ROCKSDB_NAMESPACE {

class BlobFileBuilder;

// A wrapper of internal iterator whose purpose is to count how
// many entries there are in the iterator.
class SequenceIterWrapper : public InternalIterator {
public:
SequenceIterWrapper(InternalIterator* iter, const Comparator* cmp,
bool need_count_entries)
: cmp_(cmp), inner_iter_(iter), need_count_entries_(need_count_entries) {}
bool Valid() const override { return inner_iter_->Valid(); }
Status status() const override { return inner_iter_->status(); }
void Next() override {
num_itered_++;
inner_iter_->Next();
}
void Seek(const Slice& target) override {
if (!need_count_entries_) {
inner_iter_->Seek(target);
} else {
// For flush cases, we need to count total number of entries, so we
// do Next() rather than Seek().
while (inner_iter_->Valid() &&
cmp_->Compare(inner_iter_->key(), target) < 0) {
Next();
}
}
}
Slice key() const override { return inner_iter_->key(); }
Slice value() const override { return inner_iter_->value(); }

// Unused InternalIterator methods
void SeekToFirst() override { assert(false); }
void Prev() override { assert(false); }
void SeekForPrev(const Slice& /* target */) override { assert(false); }
void SeekToLast() override { assert(false); }

uint64_t num_itered() const { return num_itered_; }

private:
const Comparator* cmp_; // not owned
InternalIterator* inner_iter_; // not owned
uint64_t num_itered_ = 0;
bool need_count_entries_;
};

class CompactionIterator {
public:
// A wrapper around Compaction. Has a much smaller interface, only what
Expand Down Expand Up @@ -162,6 +205,7 @@ class CompactionIterator {
bool Valid() const { return valid_; }
const Slice& user_key() const { return current_user_key_; }
const CompactionIterationStats& iter_stats() const { return iter_stats_; }
uint64_t num_input_entry_scanned() const { return input_.num_itered(); }

private:
// Processes the input stream to find the next output
Expand Down Expand Up @@ -234,7 +278,7 @@ class CompactionIterator {
static uint64_t ComputeBlobGarbageCollectionCutoffFileNumber(
const CompactionProxy* compaction);

InternalIterator* input_;
SequenceIterWrapper input_;
const Comparator* cmp_;
MergeHelper* merge_helper_;
const std::vector<SequenceNumber>* snapshots_;
Expand Down Expand Up @@ -342,6 +386,10 @@ class CompactionIterator {

const int level_;

void AdvanceInputIter() { input_.Next(); }

void SkipUntil(const Slice& skip_until) { input_.Seek(skip_until); }

bool IsShuttingDown() {
// This is a best-effort facility, so memory_order_relaxed is sufficient.
return shutting_down_ && shutting_down_->load(std::memory_order_relaxed);
Expand Down
2 changes: 1 addition & 1 deletion db/db_test2.cc
Original file line number Diff line number Diff line change
Expand Up @@ -4988,7 +4988,7 @@ TEST_F(DBTest2, SameSmallestInSameLevel) {
ASSERT_OK(Put("key", "2"));
ASSERT_OK(db_->Merge(WriteOptions(), "key", "3"));
ASSERT_OK(db_->Merge(WriteOptions(), "key", "4"));
Flush();
ASSERT_OK(Flush());
CompactRangeOptions cro;
cro.change_level = true;
cro.target_level = 2;
Expand Down
16 changes: 14 additions & 2 deletions db/flush_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,6 @@ void FlushJob::RecordFlushIOStats() {
ThreadStatus::FLUSH_BYTES_WRITTEN, IOSTATS(bytes_written));
IOSTATS_RESET(bytes_written);
}

void FlushJob::PickMemTable() {
db_mutex_->AssertHeld();
assert(!pick_memtable_called);
Expand Down Expand Up @@ -403,6 +402,7 @@ Status FlushJob::WriteLevel0Table() {
? current_time
: meta_.oldest_ancester_time;

uint64_t num_input_entries = 0;
IOStatus io_s;
const std::string* const full_history_ts_low =
(full_history_ts_low_.empty()) ? nullptr : &full_history_ts_low_;
Expand All @@ -420,10 +420,22 @@ Status FlushJob::WriteLevel0Table() {
earliest_write_conflict_snapshot_, snapshot_checker_,
mutable_cf_options_.paranoid_file_checks, cfd_->internal_stats(),
&io_s, io_tracer_, event_logger_, job_context_->job_id, Env::IO_HIGH,
&table_properties_, write_hint, full_history_ts_low, blob_callback_);
&table_properties_, write_hint, full_history_ts_low, blob_callback_,
&num_input_entries);
if (!io_s.ok()) {
io_status_ = io_s;
}
if (num_input_entries != total_num_entries && s.ok()) {
std::string msg = "Expected " + ToString(total_num_entries) +
" entries in memtables, but read " +
ToString(num_input_entries);
ROCKS_LOG_WARN(db_options_.info_log, "[%s] [JOB %d] Level-0 flush %s",
cfd_->GetName().c_str(), job_context_->job_id,
msg.c_str());
if (db_options_.flush_verify_memtable_count) {
s = Status::Corruption(msg);
}
}
LogFlush(db_options_.info_log);
}
ROCKS_LOG_INFO(db_options_.info_log,
Expand Down
10 changes: 5 additions & 5 deletions db/range_tombstone_fragmenter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ FragmentedRangeTombstoneList::FragmentedRangeTombstoneList(
return;
}
bool is_sorted = true;
int num_tombstones = 0;
InternalKey pinned_last_start_key;
Slice last_start_key;
num_unfragmented_tombstones_ = 0;
for (unfragmented_tombstones->SeekToFirst(); unfragmented_tombstones->Valid();
unfragmented_tombstones->Next(), num_tombstones++) {
if (num_tombstones > 0 &&
unfragmented_tombstones->Next(), num_unfragmented_tombstones_++) {
if (num_unfragmented_tombstones_ > 0 &&
icmp.Compare(last_start_key, unfragmented_tombstones->key()) > 0) {
is_sorted = false;
break;
Expand All @@ -50,8 +50,8 @@ FragmentedRangeTombstoneList::FragmentedRangeTombstoneList(

// Sort the tombstones before fragmenting them.
std::vector<std::string> keys, values;
keys.reserve(num_tombstones);
values.reserve(num_tombstones);
keys.reserve(num_unfragmented_tombstones_);
values.reserve(num_unfragmented_tombstones_);
for (unfragmented_tombstones->SeekToFirst(); unfragmented_tombstones->Valid();
unfragmented_tombstones->Next()) {
keys.emplace_back(unfragmented_tombstones->key().data(),
Expand Down
Loading

0 comments on commit 2f1984d

Please sign in to comment.