Skip to content

Commit

Permalink
Record and use the tail size to prefetch table tail (#11406)
Browse files Browse the repository at this point in the history
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in #4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: #11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
  • Loading branch information
hx235 authored and facebook-github-bot committed May 8, 2023
1 parent e1d1c50 commit 8f763bd
Show file tree
Hide file tree
Showing 39 changed files with 466 additions and 314 deletions.
3 changes: 3 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
### Bug Fixes
* Delete an empty WAL file on DB open if the log number is less than the min log number to keep

### Performance Improvements
* Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch (#11406).

## 8.2.0 (04/24/2023)
### Public API Changes
* `SstFileWriter::DeleteRange()` now returns `Status::InvalidArgument` if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.
Expand Down
1 change: 1 addition & 0 deletions db/builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@ Status BuildTable(
if (s.ok() && !empty) {
uint64_t file_size = builder->FileSize();
meta->fd.file_size = file_size;
meta->tail_size = builder->GetTailSize();
meta->marked_for_compaction = builder->NeedCompact();
assert(meta->fd.GetFileSize() > 0);
tp = builder
Expand Down
2 changes: 1 addition & 1 deletion db/compaction/compaction_job_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ class CompactionJobTestBase : public testing::Test {
kUnknownFileCreationTime,
versions_->GetColumnFamilySet()->GetDefault()->NewEpochNumber(),
kUnknownFileChecksum, kUnknownFileChecksumFuncName, kNullUniqueId64x2,
0);
0, 0);

mutex_.Lock();
EXPECT_OK(versions_->LogAndApply(
Expand Down
1 change: 1 addition & 0 deletions db/compaction/compaction_outputs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Status CompactionOutputs::Finish(const Status& intput_status,
const uint64_t current_bytes = builder_->FileSize();
if (s.ok()) {
meta->fd.file_size = current_bytes;
meta->tail_size = builder_->GetTailSize();
meta->marked_for_compaction = builder_->NeedCompact();
}
current_output().finished = true;
Expand Down
2 changes: 1 addition & 1 deletion db/compaction/compaction_picker_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ class CompactionPickerTestBase : public testing::Test {
smallest_seq, largest_seq, marked_for_compact, temperature,
kInvalidBlobFileNumber, kUnknownOldestAncesterTime,
kUnknownFileCreationTime, epoch_number, kUnknownFileChecksum,
kUnknownFileChecksumFuncName, kNullUniqueId64x2, 0);
kUnknownFileChecksumFuncName, kNullUniqueId64x2, 0, 0);
f->compensated_file_size =
(compensated_file_size != 0) ? compensated_file_size : file_size;
f->oldest_ancester_time = oldest_ancestor_time;
Expand Down
4 changes: 2 additions & 2 deletions db/db_impl/db_impl_compaction_flush.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1758,7 +1758,7 @@ Status DBImpl::ReFitLevel(ColumnFamilyData* cfd, int level, int target_level) {
f->marked_for_compaction, f->temperature, f->oldest_blob_file_number,
f->oldest_ancester_time, f->file_creation_time, f->epoch_number,
f->file_checksum, f->file_checksum_func_name, f->unique_id,
f->compensated_range_deletion_size);
f->compensated_range_deletion_size, f->tail_size);
}
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
"[%s] Apply version edit:\n%s", cfd->GetName().c_str(),
Expand Down Expand Up @@ -3457,7 +3457,7 @@ Status DBImpl::BackgroundCompaction(bool* made_progress,
f->oldest_blob_file_number, f->oldest_ancester_time,
f->file_creation_time, f->epoch_number, f->file_checksum,
f->file_checksum_func_name, f->unique_id,
f->compensated_range_deletion_size);
f->compensated_range_deletion_size, f->tail_size);

ROCKS_LOG_BUFFER(
log_buffer,
Expand Down
2 changes: 1 addition & 1 deletion db/db_impl/db_impl_experimental.cc
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ Status DBImpl::PromoteL0(ColumnFamilyHandle* column_family, int target_level) {
f->oldest_blob_file_number, f->oldest_ancester_time,
f->file_creation_time, f->epoch_number, f->file_checksum,
f->file_checksum_func_name, f->unique_id,
f->compensated_range_deletion_size);
f->compensated_range_deletion_size, f->tail_size);
}

status = versions_->LogAndApply(cfd, *cfd->GetLatestMutableCFOptions(),
Expand Down
19 changes: 10 additions & 9 deletions db/db_impl/db_impl_open.cc
Original file line number Diff line number Diff line change
Expand Up @@ -616,7 +616,8 @@ Status DBImpl::Recover(
f->oldest_blob_file_number, f->oldest_ancester_time,
f->file_creation_time, f->epoch_number,
f->file_checksum, f->file_checksum_func_name,
f->unique_id, f->compensated_range_deletion_size);
f->unique_id, f->compensated_range_deletion_size,
f->tail_size);
ROCKS_LOG_WARN(immutable_db_options_.info_log,
"[%s] Moving #%" PRIu64
" from from_level-%d to from_level-%d %" PRIu64
Expand Down Expand Up @@ -1673,14 +1674,14 @@ Status DBImpl::WriteLevel0TableForRecovery(int job_id, ColumnFamilyData* cfd,
constexpr int level = 0;

if (s.ok() && has_output) {
edit->AddFile(level, meta.fd.GetNumber(), meta.fd.GetPathId(),
meta.fd.GetFileSize(), meta.smallest, meta.largest,
meta.fd.smallest_seqno, meta.fd.largest_seqno,
meta.marked_for_compaction, meta.temperature,
meta.oldest_blob_file_number, meta.oldest_ancester_time,
meta.file_creation_time, meta.epoch_number,
meta.file_checksum, meta.file_checksum_func_name,
meta.unique_id, meta.compensated_range_deletion_size);
edit->AddFile(
level, meta.fd.GetNumber(), meta.fd.GetPathId(), meta.fd.GetFileSize(),
meta.smallest, meta.largest, meta.fd.smallest_seqno,
meta.fd.largest_seqno, meta.marked_for_compaction, meta.temperature,
meta.oldest_blob_file_number, meta.oldest_ancester_time,
meta.file_creation_time, meta.epoch_number, meta.file_checksum,
meta.file_checksum_func_name, meta.unique_id,
meta.compensated_range_deletion_size, meta.tail_size);

for (const auto& blob : blob_file_additions) {
edit->AddBlobFile(blob);
Expand Down
82 changes: 0 additions & 82 deletions db/db_test2.cc
Original file line number Diff line number Diff line change
Expand Up @@ -5298,88 +5298,6 @@ TEST_F(DBTest2, DISABLED_IteratorPinnedMemory) {
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
}

TEST_F(DBTest2, TestBBTTailPrefetch) {
std::atomic<bool> called(false);
size_t expected_lower_bound = 512 * 1024;
size_t expected_higher_bound = 512 * 1024;
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::TailPrefetchLen", [&](void* arg) {
size_t* prefetch_size = static_cast<size_t*>(arg);
EXPECT_LE(expected_lower_bound, *prefetch_size);
EXPECT_GE(expected_higher_bound, *prefetch_size);
called = true;
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();

ASSERT_OK(Put("1", "1"));
ASSERT_OK(Put("9", "1"));
ASSERT_OK(Flush());

expected_lower_bound = 0;
expected_higher_bound = 8 * 1024;

ASSERT_OK(Put("1", "1"));
ASSERT_OK(Put("9", "1"));
ASSERT_OK(Flush());

ASSERT_OK(Put("1", "1"));
ASSERT_OK(Put("9", "1"));
ASSERT_OK(Flush());

// Full compaction to make sure there is no L0 file after the open.
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));

ASSERT_TRUE(called.load());
called = false;

ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();

std::atomic<bool> first_call(true);
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::TailPrefetchLen", [&](void* arg) {
size_t* prefetch_size = static_cast<size_t*>(arg);
if (first_call) {
EXPECT_EQ(4 * 1024, *prefetch_size);
first_call = false;
} else {
EXPECT_GE(4 * 1024, *prefetch_size);
}
called = true;
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();

Options options = CurrentOptions();
options.max_file_opening_threads = 1; // one thread
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
options.max_open_files = -1;
Reopen(options);

ASSERT_OK(Put("1", "1"));
ASSERT_OK(Put("9", "1"));
ASSERT_OK(Flush());

ASSERT_OK(Put("1", "1"));
ASSERT_OK(Put("9", "1"));
ASSERT_OK(Flush());

ASSERT_TRUE(called.load());
called = false;

// Parallel loading SST files
options.max_file_opening_threads = 16;
Reopen(options);

ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));

ASSERT_TRUE(called.load());

ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}

TEST_F(DBTest2, TestGetColumnFamilyHandleUnlocked) {
// Setup sync point dependency to reproduce the race condition of
// DBImpl::GetColumnFamilyHandleUnlocked
Expand Down
2 changes: 1 addition & 1 deletion db/experimental.cc
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Status UpdateManifestForFilesState(
lf->oldest_blob_file_number, lf->oldest_ancester_time,
lf->file_creation_time, lf->epoch_number, lf->file_checksum,
lf->file_checksum_func_name, lf->unique_id,
lf->compensated_range_deletion_size);
lf->compensated_range_deletion_size, lf->tail_size);
}
}
} else {
Expand Down
12 changes: 11 additions & 1 deletion db/external_sst_file_ingestion_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,16 @@ Status ExternalSstFileIngestionJob::Run() {
current_time = oldest_ancester_time =
static_cast<uint64_t>(temp_current_time);
}
uint64_t tail_size = 0;
bool contain_no_data_blocks = f.table_properties.num_entries > 0 &&
(f.table_properties.num_entries ==
f.table_properties.num_range_deletions);
if (f.table_properties.tail_start_offset > 0 || contain_no_data_blocks) {
uint64_t file_size = f.fd.GetFileSize();
assert(f.table_properties.tail_start_offset <= file_size);
tail_size = file_size - f.table_properties.tail_start_offset;
}

FileMetaData f_metadata(
f.fd.GetNumber(), f.fd.GetPathId(), f.fd.GetFileSize(),
f.smallest_internal_key, f.largest_internal_key, f.assigned_seqno,
Expand All @@ -472,7 +482,7 @@ Status ExternalSstFileIngestionJob::Run() {
ingestion_options_.ingest_behind
? kReservedEpochNumberForFileIngestedBehind
: cfd_->NewEpochNumber(),
f.file_checksum, f.file_checksum_func_name, f.unique_id, 0);
f.file_checksum, f.file_checksum_func_name, f.unique_id, 0, tail_size);
f_metadata.temperature = f.file_temperature;
edit_.AddFile(f.picked_level, f_metadata);
}
Expand Down
3 changes: 2 additions & 1 deletion db/flush_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1007,7 +1007,8 @@ Status FlushJob::WriteLevel0Table() {
meta_.oldest_blob_file_number, meta_.oldest_ancester_time,
meta_.file_creation_time, meta_.epoch_number,
meta_.file_checksum, meta_.file_checksum_func_name,
meta_.unique_id, meta_.compensated_range_deletion_size);
meta_.unique_id, meta_.compensated_range_deletion_size,
meta_.tail_size);
edit_->SetBlobFileAdditions(std::move(blob_file_additions));
}
// Piggyback FlushJobInfo on the first first flushed memtable.
Expand Down
12 changes: 11 additions & 1 deletion db/import_column_family_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -138,14 +138,24 @@ Status ImportColumnFamilyJob::Run() {
const auto& f = files_to_import_[i];
const auto& file_metadata = metadata_[i];

uint64_t tail_size = 0;
bool contain_no_data_blocks = f.table_properties.num_entries > 0 &&
(f.table_properties.num_entries ==
f.table_properties.num_range_deletions);
if (f.table_properties.tail_start_offset > 0 || contain_no_data_blocks) {
uint64_t file_size = f.fd.GetFileSize();
assert(f.table_properties.tail_start_offset <= file_size);
tail_size = file_size - f.table_properties.tail_start_offset;
}

VersionEdit dummy_version_edit;
dummy_version_edit.AddFile(
file_metadata.level, f.fd.GetNumber(), f.fd.GetPathId(),
f.fd.GetFileSize(), f.smallest_internal_key, f.largest_internal_key,
file_metadata.smallest_seqno, file_metadata.largest_seqno, false,
file_metadata.temperature, kInvalidBlobFileNumber, oldest_ancester_time,
current_time, file_metadata.epoch_number, kUnknownFileChecksum,
kUnknownFileChecksumFuncName, f.unique_id, 0);
kUnknownFileChecksumFuncName, f.unique_id, 0, tail_size);
s = dummy_version_builder.Apply(&dummy_version_edit);
}
if (s.ok()) {
Expand Down
13 changes: 12 additions & 1 deletion db/repair.cc
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,17 @@ class Repairer {
}
t->meta.oldest_ancester_time = props->creation_time;
}
if (status.ok()) {
uint64_t tail_size = 0;
bool contain_no_data_blocks =
props->num_entries > 0 &&
(props->num_entries == props->num_range_deletions);
if (props->tail_start_offset > 0 || contain_no_data_blocks) {
assert(props->tail_start_offset <= file_size);
tail_size = file_size - props->tail_start_offset;
}
t->meta.tail_size = tail_size;
}
ColumnFamilyData* cfd = nullptr;
if (status.ok()) {
cfd = vset_.GetColumnFamilySet()->GetColumnFamily(t->column_family_id);
Expand Down Expand Up @@ -682,7 +693,7 @@ class Repairer {
table->meta.oldest_ancester_time, table->meta.file_creation_time,
table->meta.epoch_number, table->meta.file_checksum,
table->meta.file_checksum_func_name, table->meta.unique_id,
table->meta.compensated_range_deletion_size);
table->meta.compensated_range_deletion_size, table->meta.tail_size);
}
s = dummy_version_builder.Apply(&dummy_edit);
if (s.ok()) {
Expand Down
14 changes: 7 additions & 7 deletions db/table_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -140,13 +140,13 @@ Status TableCache::GetTableReader(
}
s = ioptions_.table_factory->NewTableReader(
ro,
TableReaderOptions(ioptions_, prefix_extractor, file_options,
internal_comparator, block_protection_bytes_per_key,
skip_filters, immortal_tables_,
false /* force_direct_prefetch */, level,
block_cache_tracer_, max_file_size_for_l0_meta_pin,
db_session_id_, file_meta.fd.GetNumber(),
expected_unique_id, file_meta.fd.largest_seqno),
TableReaderOptions(
ioptions_, prefix_extractor, file_options, internal_comparator,
block_protection_bytes_per_key, skip_filters, immortal_tables_,
false /* force_direct_prefetch */, level, block_cache_tracer_,
max_file_size_for_l0_meta_pin, db_session_id_,
file_meta.fd.GetNumber(), expected_unique_id,
file_meta.fd.largest_seqno, file_meta.tail_size),
std::move(file_reader), file_meta.fd.GetFileSize(), table_reader,
prefetch_index_and_filter_in_cache);
TEST_SYNC_POINT("TableCache::GetTableReader:0");
Expand Down
Loading

0 comments on commit 8f763bd

Please sign in to comment.