Skip to content

Commit

Permalink
Refactor AddRangeDels() + consider range tombstone during compaction …
Browse files Browse the repository at this point in the history
…file cutting (facebook#11113)

Summary:
A second attempt after facebook#10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (facebook#4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).

The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.

Pull Request resolved: facebook#11113

Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`

Reviewed By: ajkr

Differential Revision: D42655709

Pulled By: cbi42

fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
  • Loading branch information
cbi42 authored and facebook-github-bot committed Feb 22, 2023
1 parent 9fa9bec commit 229297d
Show file tree
Hide file tree
Showing 26 changed files with 1,408 additions and 580 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -838,6 +838,7 @@ set(SOURCES
table/get_context.cc
table/iterator.cc
table/merging_iterator.cc
table/compaction_merging_iterator.cc
table/meta_blocks.cc
table/persistent_cache_helper.cc
table/plain/plain_table_bloom.cc
Expand Down
2 changes: 2 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Rocksdb Change Log
## Unreleased
### Behavior changes
* Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.

## 8.0.0 (02/19/2023)
### Behavior changes
Expand Down
2 changes: 2 additions & 0 deletions TARGETS
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ cpp_library_wrapper(name="rocksdb_lib", srcs=[
"table/block_based/reader_common.cc",
"table/block_based/uncompression_dict_reader.cc",
"table/block_fetcher.cc",
"table/compaction_merging_iterator.cc",
"table/cuckoo/cuckoo_table_builder.cc",
"table/cuckoo/cuckoo_table_factory.cc",
"table/cuckoo/cuckoo_table_reader.cc",
Expand Down Expand Up @@ -543,6 +544,7 @@ cpp_library_wrapper(name="rocksdb_whole_archive_lib", srcs=[
"table/block_based/reader_common.cc",
"table/block_based/uncompression_dict_reader.cc",
"table/block_fetcher.cc",
"table/compaction_merging_iterator.cc",
"table/cuckoo/cuckoo_table_builder.cc",
"table/cuckoo/cuckoo_table_factory.cc",
"table/cuckoo/cuckoo_table_reader.cc",
Expand Down
4 changes: 4 additions & 0 deletions db/blob/blob_counting_iterator.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,10 @@ class BlobCountingIterator : public InternalIterator {
return iter_->GetProperty(prop_name, prop);
}

bool IsDeleteRangeSentinelKey() const override {
return iter_->IsDeleteRangeSentinelKey();
}

private:
void UpdateAndCountBlobIfNeeded() {
assert(!iter_->Valid() || iter_->status().ok());
Expand Down
5 changes: 5 additions & 0 deletions db/compaction/clipping_iterator.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,11 @@ class ClippingIterator : public InternalIterator {
return iter_->GetProperty(prop_name, prop);
}

bool IsDeleteRangeSentinelKey() const override {
assert(valid_);
return iter_->IsDeleteRangeSentinelKey();
}

private:
void UpdateValid() {
assert(!iter_->Valid() || iter_->status().ok());
Expand Down
31 changes: 24 additions & 7 deletions db/compaction/compaction_iterator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,7 @@ void CompactionIterator::NextFromInput() {
value_ = input_.value();
blob_value_.Reset();
iter_stats_.num_input_records++;
is_range_del_ = input_.IsDeleteRangeSentinelKey();

Status pik_status = ParseInternalKey(key_, &ikey_, allow_data_in_errors_);
if (!pik_status.ok()) {
Expand All @@ -483,7 +484,10 @@ void CompactionIterator::NextFromInput() {
break;
}
TEST_SYNC_POINT_CALLBACK("CompactionIterator:ProcessKV", &ikey_);

if (is_range_del_) {
validity_info_.SetValid(kRangeDeletion);
break;
}
// Update input statistics
if (ikey_.type == kTypeDeletion || ikey_.type == kTypeSingleDeletion ||
ikey_.type == kTypeDeletionWithTimestamp) {
Expand Down Expand Up @@ -705,13 +709,22 @@ void CompactionIterator::NextFromInput() {

ParsedInternalKey next_ikey;
AdvanceInputIter();
while (input_.Valid() && input_.IsDeleteRangeSentinelKey() &&
ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok() &&
cmp_->EqualWithoutTimestamp(ikey_.user_key, next_ikey.user_key)) {
// skip range tombstone start keys with the same user key
// since they are not "real" point keys.
AdvanceInputIter();
}

// Check whether the next key exists, is not corrupt, and is the same key
// as the single delete.
if (input_.Valid() &&
ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok() &&
cmp_->EqualWithoutTimestamp(ikey_.user_key, next_ikey.user_key)) {
assert(!input_.IsDeleteRangeSentinelKey());
#ifndef NDEBUG
const Compaction* c =
compaction_ ? compaction_->real_compaction() : nullptr;
Expand Down Expand Up @@ -936,12 +949,14 @@ void CompactionIterator::NextFromInput() {
// Note that a deletion marker of type kTypeDeletionWithTimestamp will be
// considered to have a different user key unless the timestamp is older
// than *full_history_ts_low_.
//
// Range tombstone start keys are skipped as they are not "real" keys.
while (!IsPausingManualCompaction() && !IsShuttingDown() &&
input_.Valid() &&
(ParseInternalKey(input_.key(), &next_ikey, allow_data_in_errors_)
.ok()) &&
cmp_->EqualWithoutTimestamp(ikey_.user_key, next_ikey.user_key) &&
(prev_snapshot == 0 ||
(prev_snapshot == 0 || input_.IsDeleteRangeSentinelKey() ||
DefinitelyNotInSnapshot(next_ikey.sequence, prev_snapshot))) {
AdvanceInputIter();
}
Expand Down Expand Up @@ -1235,10 +1250,12 @@ void CompactionIterator::DecideOutputLevel() {

void CompactionIterator::PrepareOutput() {
if (Valid()) {
if (ikey_.type == kTypeValue) {
ExtractLargeValueIfNeeded();
} else if (ikey_.type == kTypeBlobIndex) {
GarbageCollectBlobIfNeeded();
if (LIKELY(!is_range_del_)) {
if (ikey_.type == kTypeValue) {
ExtractLargeValueIfNeeded();
} else if (ikey_.type == kTypeBlobIndex) {
GarbageCollectBlobIfNeeded();
}
}

if (compaction_ != nullptr && compaction_->SupportsPerKeyPlacement()) {
Expand All @@ -1261,7 +1278,7 @@ void CompactionIterator::PrepareOutput() {
DefinitelyInSnapshot(ikey_.sequence, earliest_snapshot_) &&
ikey_.type != kTypeMerge && current_key_committed_ &&
!output_to_penultimate_level_ &&
ikey_.sequence < preserve_time_min_seqno_) {
ikey_.sequence < preserve_time_min_seqno_ && !is_range_del_) {
if (ikey_.type == kTypeDeletion ||
(ikey_.type == kTypeSingleDeletion && timestamp_size_ == 0)) {
ROCKS_LOG_FATAL(
Expand Down
18 changes: 17 additions & 1 deletion db/compaction/compaction_iterator.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ class SequenceIterWrapper : public InternalIterator {
void SeekToLast() override { assert(false); }

uint64_t num_itered() const { return num_itered_; }
bool IsDeleteRangeSentinelKey() const override {
assert(Valid());
return inner_iter_->IsDeleteRangeSentinelKey();
}

private:
InternalKeyComparator icmp_;
Expand Down Expand Up @@ -242,7 +246,12 @@ class CompactionIterator {
const Status& status() const { return status_; }
const ParsedInternalKey& ikey() const { return ikey_; }
inline bool Valid() const { return validity_info_.IsValid(); }
const Slice& user_key() const { return current_user_key_; }
const Slice& user_key() const {
if (UNLIKELY(is_range_del_)) {
return ikey_.user_key;
}
return current_user_key_;
}
const CompactionIterationStats& iter_stats() const { return iter_stats_; }
uint64_t num_input_entry_scanned() const { return input_.num_itered(); }
// If the current key should be placed on penultimate level, only valid if
Expand All @@ -252,6 +261,8 @@ class CompactionIterator {
}
Status InputStatus() const { return input_.status(); }

bool IsDeleteRangeSentinelKey() const { return is_range_del_; }

private:
// Processes the input stream to find the next output
void NextFromInput();
Expand Down Expand Up @@ -385,6 +396,7 @@ class CompactionIterator {
kKeepSD = 8,
kKeepDel = 9,
kNewUserKey = 10,
kRangeDeletion = 11,
};

struct ValidityInfo {
Expand Down Expand Up @@ -493,6 +505,10 @@ class CompactionIterator {
// This is a best-effort facility, so memory_order_relaxed is sufficient.
return manual_compaction_canceled_.load(std::memory_order_relaxed);
}

// Stores whether the current compaction iterator output
// is a range tombstone start key.
bool is_range_del_{false};
};

inline bool CompactionIterator::DefinitelyInSnapshot(SequenceNumber seq,
Expand Down
30 changes: 18 additions & 12 deletions db/compaction/compaction_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1118,6 +1118,8 @@ void CompactionJob::ProcessKeyValueCompaction(SubcompactionState* sub_compact) {
IterKey end_ikey;
Slice start_slice;
Slice end_slice;
Slice start_user_key{};
Slice end_user_key{};

static constexpr char kMaxTs[] =
"\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff";
Expand All @@ -1140,6 +1142,7 @@ void CompactionJob::ProcessKeyValueCompaction(SubcompactionState* sub_compact) {
&ts_slice);
}
start_slice = start_ikey.GetInternalKey();
start_user_key = start_ikey.GetUserKey();
}
if (end.has_value()) {
end_ikey.SetInternalKey(end.value(), kMaxSequenceNumber, kValueTypeForSeek);
Expand All @@ -1148,6 +1151,7 @@ void CompactionJob::ProcessKeyValueCompaction(SubcompactionState* sub_compact) {
&ts_slice);
}
end_slice = end_ikey.GetInternalKey();
end_user_key = end_ikey.GetUserKey();
}

std::unique_ptr<InternalIterator> clip;
Expand Down Expand Up @@ -1263,11 +1267,15 @@ void CompactionJob::ProcessKeyValueCompaction(SubcompactionState* sub_compact) {
[this, sub_compact](CompactionOutputs& outputs) {
return this->OpenCompactionOutputFile(sub_compact, outputs);
};

const CompactionFileCloseFunc close_file_func =
[this, sub_compact](CompactionOutputs& outputs, const Status& status,
const Slice& next_table_min_key) {
return this->FinishCompactionOutputFile(status, sub_compact, outputs,
next_table_min_key);
[this, sub_compact, start_user_key, end_user_key](
CompactionOutputs& outputs, const Status& status,
const Slice& next_table_min_key) {
return this->FinishCompactionOutputFile(
status, sub_compact, outputs, next_table_min_key,
sub_compact->start.has_value() ? &start_user_key : nullptr,
sub_compact->end.has_value() ? &end_user_key : nullptr);
};

Status status;
Expand All @@ -1278,7 +1286,6 @@ void CompactionJob::ProcessKeyValueCompaction(SubcompactionState* sub_compact) {
while (status.ok() && !cfd->IsDropped() && c_iter->Valid()) {
// Invariant: c_iter.status() is guaranteed to be OK if c_iter->Valid()
// returns true.

assert(!end.has_value() || cfd->user_comparator()->Compare(
c_iter->user_key(), end.value()) < 0);

Expand Down Expand Up @@ -1458,7 +1465,8 @@ void CompactionJob::RecordDroppedKeys(

Status CompactionJob::FinishCompactionOutputFile(
const Status& input_status, SubcompactionState* sub_compact,
CompactionOutputs& outputs, const Slice& next_table_min_key) {
CompactionOutputs& outputs, const Slice& next_table_min_key,
const Slice* comp_start_user_key, const Slice* comp_end_user_key) {
AutoThreadOperationStageUpdater stage_updater(
ThreadStatus::STAGE_COMPACTION_SYNC_FILE);
assert(sub_compact != nullptr);
Expand Down Expand Up @@ -1488,12 +1496,10 @@ Status CompactionJob::FinishCompactionOutputFile(
// output_to_penultimate_level compaction here, as it's only used to decide
// if range dels could be dropped.
if (outputs.HasRangeDel()) {
s = outputs.AddRangeDels(
sub_compact->start.has_value() ? &(sub_compact->start.value())
: nullptr,
sub_compact->end.has_value() ? &(sub_compact->end.value()) : nullptr,
range_del_out_stats, bottommost_level_, cfd->internal_comparator(),
earliest_snapshot, next_table_min_key, full_history_ts_low_);
s = outputs.AddRangeDels(comp_start_user_key, comp_end_user_key,
range_del_out_stats, bottommost_level_,
cfd->internal_comparator(), earliest_snapshot,
next_table_min_key, full_history_ts_low_);
}
RecordDroppedKeys(range_del_out_stats, &sub_compact->compaction_job_stats);
TEST_SYNC_POINT("CompactionJob::FinishCompactionOutputFile1");
Expand Down
4 changes: 3 additions & 1 deletion db/compaction/compaction_job.h
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,9 @@ class CompactionJob {
Status FinishCompactionOutputFile(const Status& input_status,
SubcompactionState* sub_compact,
CompactionOutputs& outputs,
const Slice& next_table_min_key);
const Slice& next_table_min_key,
const Slice* comp_start_user_key,
const Slice* comp_end_user_key);
Status InstallCompactionResults(const MutableCFOptions& mutable_cf_options);
Status OpenCompactionOutputFile(SubcompactionState* sub_compact,
CompactionOutputs& outputs);
Expand Down
Loading

0 comments on commit 229297d

Please sign in to comment.