Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logically strip timestamp during flush #11557

Closed
wants to merge 3 commits into from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Jun 23, 2023

Logically strip the user-defined timestamp when L0 files are created during flush when AdvancedColumnFamilyOptions.persist_user_defined_timestamps is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes.

While working on this, I caught a missing piece on the BlockBuilder level for this feature. The current quick path std::min(buffer_size, last_key_size) needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, buffer is empty and buffer_size is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in buffer can sometimes be smaller than last_key_size, leading std::min(buffer_size, last_key_size) to truncate the last_key. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial std::min operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced.
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
PR vs base:
Round 1: 350652 vs 349055
Round 2: 365733 vs 364308
Round 3: 355681 vs 354475

Test Plan:
New timestamp specific test added or existing tests augmented, both are parameterized with UserDefinedTimestampTestMode:
UserDefinedTimestampTestMode::kNormal -> UDT feature enabled, write / read with min timestamp
UserDefinedTimestampTestMode::kStripUserDefinedTimestamps -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false.

make all check
./db_wal_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
./block_based_table_reader_test

@jowlyzhang jowlyzhang marked this pull request as draft June 24, 2023 00:01
@jowlyzhang jowlyzhang marked this pull request as ready for review June 26, 2023 18:12
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang requested a review from pdillinger June 26, 2023 18:12
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that some places talk about stripping or removing the timestamps while some talk about replacing with minimum timestamp. I might expect that if I don't persist timestamps, I could re-open the DB without timestamps configured at all, but I don't think that is true. I think we need to be more clear and careful about describing what the persist_user_defined_timestamps option does.

db/builder.cc Outdated Show resolved Hide resolved
db/builder.cc Outdated Show resolved Hide resolved
db/dbformat.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang
Copy link
Contributor Author

I find it confusing that some places talk about stripping or removing the timestamps while some talk about replacing with minimum timestamp. I might expect that if I don't persist timestamps, I could re-open the DB without timestamps configured at all, but I don't think that is true. I think we need to be more clear and careful about describing what the persist_user_defined_timestamps option does.

Thank you for this suggestion! I took a look at the existing comment for persist_user_defined_timestamps flag and realized it indeed needed more clarification. I added some comment there.

Specifically on this flush path, what not persist_user_defined_timestamps does is indeed removing the timestamps and do not persist them. The reason there are some places that logically strip (replace with min timestamp), and some places that actually strip (or remove) is because there are still some key comparisons happening at different levels during flush, for example, in BlockBasedTableBuilder, IndexBuilder etc. To ensure the user keys still have a compatible format as the timestamp aware user comparator, the user-defined timestamp part of the user key is not physically removed until the very last minute before it's building into a block in BlockBuilder, and before that, it's only logically removed.

Because these SST files are created without any user-defined timestamps, users indeed can re-open the DB without timestamps configured at all. At that time, they provide a user comparator that has timestamp_size set to 0, and the user keys in these SST files are instantly compatible with that format.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang requested a review from pdillinger June 29, 2023 17:07
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang jowlyzhang requested a review from pdillinger June 29, 2023 17:27
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the described hole in testing. Could be fixed in this PR or another.

@pdillinger
Copy link
Contributor

Don't forget to re-import :)

@jowlyzhang
Copy link
Contributor Author

LGTM except for the described hole in testing. Could be fixed in this PR or another.

Thank you for your review! I will add that test coverage in a follow up.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 15053f3.

facebook-github-bot pushed a commit that referenced this pull request Jul 7, 2023
Summary:
Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after #11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false.  This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior.

This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification.

On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed.

Pull Request resolved: #11577

Test Plan: ./db_wal_test

Reviewed By: pdillinger

Differential Revision: D47142892

Pulled By: jowlyzhang

fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d
@jowlyzhang jowlyzhang deleted the flush_strip branch July 11, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants