Skip to content

Commit

Permalink
Make it possible to force the garbage collection of the oldest blob f…
Browse files Browse the repository at this point in the history
…iles (#8994)

Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.

In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).

These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)

This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).

Pull Request resolved: facebook/rocksdb#8994

Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.

Reviewed By: riversand963

Differential Revision: D31489850

Pulled By: ltamasi

fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
  • Loading branch information
ltamasi authored and facebook-github-bot committed Oct 12, 2021
1 parent a282eff commit 3e1bf77
Show file tree
Hide file tree
Showing 25 changed files with 475 additions and 34 deletions.
9 changes: 9 additions & 0 deletions db/c.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2751,6 +2751,15 @@ double rocksdb_options_get_blob_gc_age_cutoff(rocksdb_options_t* opt) {
return opt->rep.blob_garbage_collection_age_cutoff;
}

void rocksdb_options_set_blob_gc_force_threshold(rocksdb_options_t* opt,
double val) {
opt->rep.blob_garbage_collection_force_threshold = val;
}

double rocksdb_options_get_blob_gc_force_threshold(rocksdb_options_t* opt) {
return opt->rep.blob_garbage_collection_force_threshold;
}

void rocksdb_options_set_num_levels(rocksdb_options_t* opt, int n) {
opt->rep.num_levels = n;
}
Expand Down
7 changes: 5 additions & 2 deletions db/c_test.c
Original file line number Diff line number Diff line change
Expand Up @@ -1793,8 +1793,11 @@ int main(int argc, char** argv) {
rocksdb_options_set_enable_blob_gc(o, 1);
CheckCondition(1 == rocksdb_options_get_enable_blob_gc(o));

rocksdb_options_set_blob_gc_age_cutoff(o, 0.75);
CheckCondition(0.75 == rocksdb_options_get_blob_gc_age_cutoff(o));
rocksdb_options_set_blob_gc_age_cutoff(o, 0.5);
CheckCondition(0.5 == rocksdb_options_get_blob_gc_age_cutoff(o));

rocksdb_options_set_blob_gc_force_threshold(o, 0.75);
CheckCondition(0.75 == rocksdb_options_get_blob_gc_force_threshold(o));

// Create a copy that should be equal to the original.
rocksdb_options_t* copy;
Expand Down
19 changes: 13 additions & 6 deletions db/column_family.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1358,12 +1358,19 @@ Status ColumnFamilyData::ValidateOptions(
}
}

if (cf_options.enable_blob_garbage_collection &&
(cf_options.blob_garbage_collection_age_cutoff < 0.0 ||
cf_options.blob_garbage_collection_age_cutoff > 1.0)) {
return Status::InvalidArgument(
"The age cutoff for blob garbage collection should be in the range "
"[0.0, 1.0].");
if (cf_options.enable_blob_garbage_collection) {
if (cf_options.blob_garbage_collection_age_cutoff < 0.0 ||
cf_options.blob_garbage_collection_age_cutoff > 1.0) {
return Status::InvalidArgument(
"The age cutoff for blob garbage collection should be in the range "
"[0.0, 1.0].");
}
if (cf_options.blob_garbage_collection_force_threshold < 0.0 ||
cf_options.blob_garbage_collection_force_threshold > 1.0) {
return Status::InvalidArgument(
"The garbage ratio threshold for forcing blob garbage collection "
"should be in the range [0.0, 1.0].");
}
}

if (cf_options.compaction_style == kCompactionStyleFIFO &&
Expand Down
24 changes: 24 additions & 0 deletions db/column_family_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3407,6 +3407,30 @@ TEST(ColumnFamilyTest, ValidateBlobGCCutoff) {
.IsInvalidArgument());
}

TEST(ColumnFamilyTest, ValidateBlobGCForceThreshold) {
DBOptions db_options;

ColumnFamilyOptions cf_options;
cf_options.enable_blob_garbage_collection = true;

cf_options.blob_garbage_collection_force_threshold = -0.5;
ASSERT_TRUE(ColumnFamilyData::ValidateOptions(db_options, cf_options)
.IsInvalidArgument());

cf_options.blob_garbage_collection_force_threshold = 0.0;
ASSERT_OK(ColumnFamilyData::ValidateOptions(db_options, cf_options));

cf_options.blob_garbage_collection_force_threshold = 0.5;
ASSERT_OK(ColumnFamilyData::ValidateOptions(db_options, cf_options));

cf_options.blob_garbage_collection_force_threshold = 1.0;
ASSERT_OK(ColumnFamilyData::ValidateOptions(db_options, cf_options));

cf_options.blob_garbage_collection_force_threshold = 1.5;
ASSERT_TRUE(ColumnFamilyData::ValidateOptions(db_options, cf_options)
.IsInvalidArgument());
}

} // namespace ROCKSDB_NAMESPACE

#ifdef ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS
Expand Down
2 changes: 2 additions & 0 deletions db/compaction/compaction_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ const char* GetCompactionReasonString(CompactionReason compaction_reason) {
return "PeriodicCompaction";
case CompactionReason::kChangeTemperature:
return "ChangeTemperature";
case CompactionReason::kForcedBlobGC:
return "ForcedBlobGC";
case CompactionReason::kNumOfReasons:
// fall through
default:
Expand Down
10 changes: 10 additions & 0 deletions db/compaction/compaction_picker_level.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ bool LevelCompactionPicker::NeedsCompaction(
if (!vstorage->FilesMarkedForCompaction().empty()) {
return true;
}
if (!vstorage->FilesMarkedForForcedBlobGC().empty()) {
return true;
}
for (int i = 0; i <= vstorage->MaxInputLevel(); i++) {
if (vstorage->CompactionScore(i) >= 1) {
return true;
Expand Down Expand Up @@ -248,6 +251,13 @@ void LevelCompactionBuilder::SetupInitialFiles() {
compaction_reason_ = CompactionReason::kPeriodicCompaction;
return;
}

// Forced blob garbage collection
PickFileToCompact(vstorage_->FilesMarkedForForcedBlobGC(), false);
if (!start_level_inputs_.empty()) {
compaction_reason_ = CompactionReason::kForcedBlobGC;
return;
}
}

bool LevelCompactionBuilder::SetupOtherL0FilesIfNeeded() {
Expand Down
109 changes: 109 additions & 0 deletions db/version_set.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2817,6 +2817,15 @@ void VersionStorageInfo::ComputeCompactionScore(
ComputeFilesMarkedForPeriodicCompaction(
immutable_options, mutable_cf_options.periodic_compaction_seconds);
}

if (mutable_cf_options.enable_blob_garbage_collection &&
mutable_cf_options.blob_garbage_collection_age_cutoff > 0.0 &&
mutable_cf_options.blob_garbage_collection_force_threshold < 1.0) {
ComputeFilesMarkedForForcedBlobGC(
mutable_cf_options.blob_garbage_collection_age_cutoff,
mutable_cf_options.blob_garbage_collection_force_threshold);
}

EstimateCompactionBytesNeeded(mutable_cf_options);
}

Expand Down Expand Up @@ -2926,6 +2935,106 @@ void VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction(
}
}

void VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC(
double blob_garbage_collection_age_cutoff,
double blob_garbage_collection_force_threshold) {
files_marked_for_forced_blob_gc_.clear();

if (blob_files_.empty()) {
return;
}

// Number of blob files eligible for GC based on age
const size_t cutoff_count = static_cast<size_t>(
blob_garbage_collection_age_cutoff * blob_files_.size());
if (!cutoff_count) {
return;
}

// Compute the sum of total and garbage bytes over the oldest batch of blob
// files. The oldest batch is defined as the set of blob files which are
// kept alive by the same SSTs as the very oldest one. Here is a toy example.
// Let's assume we have three SSTs 1, 2, and 3, and four blob files 10, 11,
// 12, and 13. Also, let's say SSTs 1 and 2 both rely on blob file 10 and
// potentially some higher-numbered ones, while SST 3 relies on blob file 12
// and potentially some higher-numbered ones. Then, the SST to oldest blob
// file mapping is as follows:
//
// SST file number Oldest blob file number
// 1 10
// 2 10
// 3 12
//
// This is what the same thing looks like from the blob files' POV. (Note that
// the linked SSTs simply denote the inverse mapping of the above.)
//
// Blob file number Linked SST set
// 10 {1, 2}
// 11 {}
// 12 {3}
// 13 {}
//
// Then, the oldest batch of blob files consists of blob files 10 and 11,
// and we can get rid of them by forcing the compaction of SSTs 1 and 2.
//
// Note that the overall ratio of garbage computed for the batch has to exceed
// blob_garbage_collection_force_threshold and the entire batch has to be
// eligible for GC according to blob_garbage_collection_age_cutoff in order
// for us to schedule any compactions.
const auto oldest_it = blob_files_.begin();

const auto& oldest_meta = oldest_it->second;
assert(oldest_meta);

const auto& linked_ssts = oldest_meta->GetLinkedSsts();
assert(!linked_ssts.empty());

size_t count = 1;
uint64_t sum_total_blob_bytes = oldest_meta->GetTotalBlobBytes();
uint64_t sum_garbage_blob_bytes = oldest_meta->GetGarbageBlobBytes();

auto it = oldest_it;
for (++it; it != blob_files_.end(); ++it) {
const auto& meta = it->second;
assert(meta);

if (!meta->GetLinkedSsts().empty()) {
break;
}

if (++count > cutoff_count) {
return;
}

sum_total_blob_bytes += meta->GetTotalBlobBytes();
sum_garbage_blob_bytes += meta->GetGarbageBlobBytes();
}

if (sum_garbage_blob_bytes <
blob_garbage_collection_force_threshold * sum_total_blob_bytes) {
return;
}

for (uint64_t sst_file_number : linked_ssts) {
const FileLocation location = GetFileLocation(sst_file_number);
assert(location.IsValid());

const int level = location.GetLevel();
assert(level >= 0);

const size_t pos = location.GetPosition();

FileMetaData* const sst_meta = files_[level][pos];
assert(sst_meta);

if (sst_meta->being_compacted) {
continue;
}

files_marked_for_forced_blob_gc_.emplace_back(level, sst_meta);
}
}

namespace {

// used to sort files by size
Expand Down
18 changes: 18 additions & 0 deletions db/version_set.h
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,14 @@ class VersionStorageInfo {
// REQUIRES: DB mutex held
void ComputeBottommostFilesMarkedForCompaction();

// This computes files_marked_for_forced_blob_gc_ and is called by
// ComputeCompactionScore()
//
// REQUIRES: DB mutex held
void ComputeFilesMarkedForForcedBlobGC(
double blob_garbage_collection_age_cutoff,
double blob_garbage_collection_force_threshold);

// Generate level_files_brief_ from files_
void GenerateLevelFilesBrief();
// Sort all files for this version based on their file size and
Expand Down Expand Up @@ -404,6 +412,14 @@ class VersionStorageInfo {
return bottommost_files_marked_for_compaction_;
}

// REQUIRES: This version has been saved (see VersionSet::SaveTo)
// REQUIRES: DB mutex held during access
const autovector<std::pair<int, FileMetaData*>>& FilesMarkedForForcedBlobGC()
const {
assert(finalized_);
return files_marked_for_forced_blob_gc_;
}

int base_level() const { return base_level_; }
double level_multiplier() const { return level_multiplier_; }

Expand Down Expand Up @@ -586,6 +602,8 @@ class VersionStorageInfo {
autovector<std::pair<int, FileMetaData*>>
bottommost_files_marked_for_compaction_;

autovector<std::pair<int, FileMetaData*>> files_marked_for_forced_blob_gc_;

// Threshold for needing to mark another bottommost file. Maintain it so we
// can quickly check when releasing a snapshot whether more bottommost files
// became eligible for compaction. It's defined as the min of the max nonzero
Expand Down
Loading

0 comments on commit 3e1bf77

Please sign in to comment.