Skip to content

Commit

Permalink
Logically strip timestamp during flush (#11557)
Browse files Browse the repository at this point in the history
Summary:
Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes.

While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced.
```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000```
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
                       PR  vs base:
Round 1: 350652 vs 349055
Round 2: 365733 vs 364308
Round 3: 355681 vs 354475

Pull Request resolved: #11557

Test Plan:
New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`:
`UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp
`UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false.

```
make all check
./db_wal_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
./block_based_table_reader_test
```

Reviewed By: pdillinger

Differential Revision: D47027664

Pulled By: jowlyzhang

fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5
  • Loading branch information
jowlyzhang authored and facebook-github-bot committed Jun 29, 2023
1 parent bfdc910 commit 15053f3
Show file tree
Hide file tree
Showing 10 changed files with 322 additions and 71 deletions.
33 changes: 27 additions & 6 deletions db/builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

#include "db/blob/blob_file_builder.h"
#include "db/compaction/compaction_iterator.h"
#include "db/dbformat.h"
#include "db/event_helpers.h"
#include "db/internal_stats.h"
#include "db/merge_helper.h"
Expand Down Expand Up @@ -205,21 +206,38 @@ Status BuildTable(
/*compaction=*/nullptr, compaction_filter.get(),
/*shutting_down=*/nullptr, db_options.info_log, full_history_ts_low);

const size_t ts_sz = ucmp->timestamp_size();
const bool strip_timestamp =
ts_sz > 0 && !ioptions.persist_user_defined_timestamps;

std::string key_after_flush_buf;
c_iter.SeekToFirst();
for (; c_iter.Valid(); c_iter.Next()) {
const Slice& key = c_iter.key();
const Slice& value = c_iter.value();
const ParsedInternalKey& ikey = c_iter.ikey();
// Generate a rolling 64-bit hash of the key and values
// Note :
// Here "key" integrates 'sequence_number'+'kType'+'user key'.
s = output_validator.Add(key, value);
Slice key_after_flush = key;
// If user defined timestamps will be stripped from user key after flush,
// the in memory version of the key act logically the same as one with a
// minimum timestamp. We update the timestamp here so file boundary and
// output validator, block builder all see the effect of the stripping.
if (strip_timestamp) {
key_after_flush_buf.clear();
ReplaceInternalKeyWithMinTimestamp(&key_after_flush_buf, key, ts_sz);
key_after_flush = key_after_flush_buf;
}

// Generate a rolling 64-bit hash of the key and values
// Note :
// Here "key" integrates 'sequence_number'+'kType'+'user key'.
s = output_validator.Add(key_after_flush, value);
if (!s.ok()) {
break;
}
builder->Add(key, value);
builder->Add(key_after_flush, value);

s = meta->UpdateBoundaries(key, value, ikey.sequence, ikey.type);
s = meta->UpdateBoundaries(key_after_flush, value, ikey.sequence,
ikey.type);
if (!s.ok()) {
break;
}
Expand All @@ -244,6 +262,7 @@ Status BuildTable(
range_del_it->Next()) {
auto tombstone = range_del_it->Tombstone();
auto kv = tombstone.Serialize();
// TODO(yuzhangyu): handle range deletion for UDT in memtables only.
builder->Add(kv.first.Encode(), kv.second);
InternalKey tombstone_end = tombstone.SerializeEndKey();
meta->UpdateBoundariesForRange(kv.first, tombstone_end, tombstone.seq_,
Expand Down Expand Up @@ -354,6 +373,8 @@ Status BuildTable(
s = *io_status;
}

// TODO(yuzhangyu): handle the key copy in the blob when ts should be
// stripped.
if (blob_file_builder) {
if (s.ok()) {
s = blob_file_builder->Finish();
Expand Down
52 changes: 51 additions & 1 deletion db/db_wal_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,9 @@ TEST_F(DBWALTest, Recover) {
} while (ChangeWalOptions());
}

class DBWALTestWithTimestamp : public DBBasicTestWithTimestampBase {
class DBWALTestWithTimestamp
: public DBBasicTestWithTimestampBase,
public testing::WithParamInterface<test::UserDefinedTimestampTestMode> {
public:
DBWALTestWithTimestamp()
: DBBasicTestWithTimestampBase("db_wal_test_with_timestamp") {}
Expand Down Expand Up @@ -401,6 +403,54 @@ TEST_F(DBWALTestWithTimestamp, RecoverInconsistentTimestamp) {
ReopenColumnFamiliesWithTs({"pikachu"}, ts_options).IsInvalidArgument());
}

TEST_P(DBWALTestWithTimestamp, RecoverAndFlush) {
// Set up the option that enables user defined timestmp size.
std::string min_ts = Timestamp(0, 0);
std::string write_ts = Timestamp(1, 0);
const size_t kTimestampSize = write_ts.size();
TestComparator test_cmp(kTimestampSize);
Options ts_options;
ts_options.create_if_missing = true;
ts_options.comparator = &test_cmp;
bool persist_udt = test::ShouldPersistUDT(GetParam());
ts_options.persist_user_defined_timestamps = persist_udt;

std::string smallest_ukey_without_ts = "baz";
std::string largest_ukey_without_ts = "foo";

ASSERT_OK(CreateAndReopenWithCFWithTs({"pikachu"}, ts_options));
ASSERT_OK(Put(1, largest_ukey_without_ts, write_ts, "v1"));
ASSERT_OK(Put(1, smallest_ukey_without_ts, write_ts, "v5"));

// Very small write buffer size to force flush memtables recovered from WAL.
ts_options.write_buffer_size = 16;
ts_options.arena_block_size = 16;
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options));
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
static_cast<uint64_t>(1));

std::vector<std::vector<FileMetaData>> level_to_files;
dbfull()->TEST_GetFilesMetaData(handles_[1], &level_to_files);
ASSERT_GT(level_to_files.size(), 1);
// L0 only has one SST file.
ASSERT_EQ(level_to_files[0].size(), 1);
auto meta = level_to_files[0][0];
if (persist_udt) {
ASSERT_EQ(smallest_ukey_without_ts + write_ts, meta.smallest.user_key());
ASSERT_EQ(largest_ukey_without_ts + write_ts, meta.largest.user_key());
} else {
ASSERT_EQ(smallest_ukey_without_ts + min_ts, meta.smallest.user_key());
ASSERT_EQ(largest_ukey_without_ts + min_ts, meta.largest.user_key());
}
}

// Param 0: test mode for the user-defined timestamp feature
INSTANTIATE_TEST_CASE_P(
RecoverAndFlush, DBWALTestWithTimestamp,
::testing::Values(
test::UserDefinedTimestampTestMode::kStripUserDefinedTimestamp,
test::UserDefinedTimestampTestMode::kNormal));

TEST_F(DBWALTest, RecoverWithTableHandle) {
do {
Options options = CurrentOptions();
Expand Down
10 changes: 10 additions & 0 deletions db/dbformat.cc
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,16 @@ void StripTimestampFromInternalKey(std::string* result, const Slice& key,
kNumInternalBytes);
}

void ReplaceInternalKeyWithMinTimestamp(std::string* result, const Slice& key,
size_t ts_sz) {
const size_t key_sz = key.size();
assert(key_sz >= ts_sz + kNumInternalBytes);
result->reserve(key_sz);
result->append(key.data(), key_sz - kNumInternalBytes - ts_sz);
result->append(ts_sz, static_cast<unsigned char>(0));
result->append(key.data() + key_sz - kNumInternalBytes, kNumInternalBytes);
}

std::string ParsedInternalKey::DebugString(bool log_err_key, bool hex) const {
std::string result = "'";
if (log_err_key) {
Expand Down
51 changes: 28 additions & 23 deletions db/dbformat.h
Original file line number Diff line number Diff line change
Expand Up @@ -168,51 +168,57 @@ inline void UnPackSequenceAndType(uint64_t packed, uint64_t* seq,
EntryType GetEntryType(ValueType value_type);

// Append the serialization of "key" to *result.
extern void AppendInternalKey(std::string* result,
const ParsedInternalKey& key);
void AppendInternalKey(std::string* result, const ParsedInternalKey& key);

// Append the serialization of "key" to *result, replacing the original
// timestamp with argument ts.
extern void AppendInternalKeyWithDifferentTimestamp(
std::string* result, const ParsedInternalKey& key, const Slice& ts);
void AppendInternalKeyWithDifferentTimestamp(std::string* result,
const ParsedInternalKey& key,
const Slice& ts);

// Serialized internal key consists of user key followed by footer.
// This function appends the footer to *result, assuming that *result already
// contains the user key at the end.
extern void AppendInternalKeyFooter(std::string* result, SequenceNumber s,
ValueType t);
void AppendInternalKeyFooter(std::string* result, SequenceNumber s,
ValueType t);

// Append the key and a minimal timestamp to *result
extern void AppendKeyWithMinTimestamp(std::string* result, const Slice& key,
size_t ts_sz);
void AppendKeyWithMinTimestamp(std::string* result, const Slice& key,
size_t ts_sz);

// Append the key and a maximal timestamp to *result
extern void AppendKeyWithMaxTimestamp(std::string* result, const Slice& key,
size_t ts_sz);
void AppendKeyWithMaxTimestamp(std::string* result, const Slice& key,
size_t ts_sz);

// `key` is a user key with timestamp. Append the user key without timestamp
// and the maximal timestamp to *result.
extern void AppendUserKeyWithMaxTimestamp(std::string* result, const Slice& key,
size_t ts_sz);
void AppendUserKeyWithMaxTimestamp(std::string* result, const Slice& key,
size_t ts_sz);

// `key` is an internal key containing a user key without timestamp. Create a
// new key in *result by padding a min timestamp of size `ts_sz` to the user key
// and copying the remaining internal key bytes.
extern void PadInternalKeyWithMinTimestamp(std::string* result,
const Slice& key, size_t ts_sz);
void PadInternalKeyWithMinTimestamp(std::string* result, const Slice& key,
size_t ts_sz);

// `key` is an internal key containing a user key with timestamp of size
// `ts_sz`. Create a new internal key in *result by stripping the timestamp from
// the user key and copying the remaining internal key bytes.
extern void StripTimestampFromInternalKey(std::string* result, const Slice& key,
size_t ts_sz);
void StripTimestampFromInternalKey(std::string* result, const Slice& key,
size_t ts_sz);

// `key` is an internal key containing a user key with timestamp of size
// `ts_sz`. Create a new internal key in *result while replace the original
// timestamp with min timestamp.
void ReplaceInternalKeyWithMinTimestamp(std::string* result, const Slice& key,
size_t ts_sz);

// Attempt to parse an internal key from "internal_key". On success,
// stores the parsed data in "*result", and returns true.
//
// On error, returns false, leaves "*result" in an undefined state.
extern Status ParseInternalKey(const Slice& internal_key,
ParsedInternalKey* result, bool log_err_key);
Status ParseInternalKey(const Slice& internal_key, ParsedInternalKey* result,
bool log_err_key);

// Returns the user key portion of an internal key.
inline Slice ExtractUserKey(const Slice& internal_key) {
Expand Down Expand Up @@ -783,8 +789,7 @@ class InternalKeySliceTransform : public SliceTransform {
// Read the key of a record from a write batch.
// if this record represent the default column family then cf_record
// must be passed as false, otherwise it must be passed as true.
extern bool ReadKeyFromWriteBatchEntry(Slice* input, Slice* key,
bool cf_record);
bool ReadKeyFromWriteBatchEntry(Slice* input, Slice* key, bool cf_record);

// Read record from a write batch piece from input.
// tag, column_family, key, value and blob are return values. Callers own the
Expand All @@ -793,9 +798,9 @@ extern bool ReadKeyFromWriteBatchEntry(Slice* input, Slice* key,
// input will be advanced to after the record.
// If user-defined timestamp is enabled for a column family, then the `key`
// resulting from this call will include timestamp.
extern Status ReadRecordFromWriteBatch(Slice* input, char* tag,
uint32_t* column_family, Slice* key,
Slice* value, Slice* blob, Slice* xid);
Status ReadRecordFromWriteBatch(Slice* input, char* tag,
uint32_t* column_family, Slice* key,
Slice* value, Slice* blob, Slice* xid);

// When user call DeleteRange() to delete a range of keys,
// we will store a serialized RangeTombstone in MemTable and SST.
Expand Down
21 changes: 21 additions & 0 deletions db/dbformat_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,27 @@ TEST_F(FormatTest, StripTimestampFromInternalKey) {
ASSERT_EQ(kTypeValue, key_without_timestamp.type);
}

TEST_F(FormatTest, ReplaceInternalKeyWithMinTimestamp) {
std::string orig_user_key = "foo";
size_t ts_sz = 8;
orig_user_key.append(ts_sz, static_cast<unsigned char>(1));
std::string orig_internal_key = IKey(orig_user_key, 100, kTypeValue);

std::string key_buf;
ReplaceInternalKeyWithMinTimestamp(&key_buf, orig_internal_key, ts_sz);
ParsedInternalKey new_key;
Slice in(key_buf);
ASSERT_OK(ParseInternalKey(in, &new_key, true /*log_err_key*/));

std::string min_timestamp(ts_sz, static_cast<unsigned char>(0));
size_t ukey_diff_offset = new_key.user_key.difference_offset(orig_user_key);
ASSERT_EQ(min_timestamp,
Slice(new_key.user_key.data() + ukey_diff_offset, ts_sz));
ASSERT_EQ(orig_user_key.size(), new_key.user_key.size());
ASSERT_EQ(100, new_key.sequence);
ASSERT_EQ(kTypeValue, new_key.type);
}

} // namespace ROCKSDB_NAMESPACE

int main(int argc, char** argv) {
Expand Down
Loading

0 comments on commit 15053f3

Please sign in to comment.