-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make mempurge a background process (equivalent to in-memory compaction). #8505
Make mempurge a background process (equivalent to in-memory compaction). #8505
Conversation
…w memtable as the new_mem becomes full.
…h. Make the mempurge happen like an in memory compaction. If potentially interesting add half-filled mempurged memtable back to imm memtable lsit.
…le, but add to imm without triggering flush.
… needs to be done for test 2 and range filters and iterators.
…remodeling test 3.
…gedelete iterotrs and compaction filters.
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
…ushed to storage. Add manual flush to DB close() function for extra safety.
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
… when mempurge is on.
…ndefined behavior.
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
db/db_flush_test.cc
Outdated
const size_t RAND_VALUES_LENGTH = 512; | ||
bool atLeastOneFlush = false; | ||
const size_t NUM_REPEAT = 1000; | ||
const size_t RAND_VALUES_LENGTH = 20480; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option to lower these values (and speed up the test) would be to change the memtable size (right now: uses the default 64MB size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are definitely on the slower end of the spectrum (all 3 tests take about 3 seconds each with a regular "make db_flush_test"). I can look at bringing them under 1sec each.
@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() { | |||
flush_scheduler_.Clear(); | |||
trim_history_scheduler_.Clear(); | |||
|
|||
// For now, simply trigger a manual flush at close time | |||
// on all the column families. | |||
if (immutable_db_options_.experimental_allow_mempurge) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, automagically flush all column families. In the future I can do a more fine-grained flushing by first checking if there is a need for flushing (but need to implement something else than imm()->IsFlushPending() because the output memtables added to imm() dont trigger flushes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge)
code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.
db/flush_job.cc
Outdated
// only if it filled at less than 10% capacity (arbitrary heuristic). | ||
if (new_mem->ApproximateMemoryUsage() < | ||
static_cast<size_t>( | ||
ceil(0.1 * mutable_cf_options_.write_buffer_size))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arbitrary metric: add to imm() only if memtable at less then 10% capacity. Reason for this: minimize Get/Read overheads that come from storing an extra Imm memtable, while giving us a chance to perform in memory compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, could avoid floating point with
(mutable_cf_options_.write_buffer_size + 9) / 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick, I'll go ahead and edit that (even though long term there is a chance we need the flexibility a double would bring).
db/flush_job.cc
Outdated
m->NewRangeTombstoneIterator(ro, kMaxSequenceNumber); | ||
if (range_del_iter != nullptr) { | ||
range_del_iters.emplace_back(range_del_iter); | ||
if (!(m->GetMempurged())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScopedArenaIterator iter( | ||
NewMergingIterator(&cfd_->internal_comparator(), &memtables[0], | ||
NewMergingIterator(&cfd_->internal_comparator(), memtables.data(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of vector.data() is preferable (allowed in C++11), because &memtables[0] leads to an undefined behavior if memtables is empty (I also added an if-statment, pure paranoia).
@@ -527,7 +528,8 @@ void MemTableList::Add(MemTable* m, autovector<MemTable*>* to_delete) { | |||
current_->Add(m, to_delete); | |||
m->MarkImmutable(); | |||
num_flush_not_started_++; | |||
if (num_flush_not_started_ == 1) { | |||
|
|||
if (num_flush_not_started_ > 0 && trigger_flush) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one thing that could potentially lead to issues, maybe? Before, they stored "flush_needed" as soon as num_flush_not_started reached exactly 1. Why didnt they use num_flush_not_started>0, which should have been strictly equivalent? Either way, here we still need to increment num_flush_not_started but we dont want to trigger flush, otherwise the flush will keep spinning.
db/flush_job.cc
Outdated
// number) needs to be present in the new memtable. | ||
new_mem->SetFirstSequenceNumber(new_first_seqno); | ||
purged_mems.push_back(new_mem); | ||
new_mem = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strategy is also probably something we want to discuss: should we create new memtables, or simply abort if we end up in this situation?
Basically: pros of creating new mems: we dont waste the time (possibly?) spent in mempurge.
Cons: spikes in memory usage by memtables.
@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() { | |||
flush_scheduler_.Clear(); | |||
trim_history_scheduler_.Clear(); | |||
|
|||
// For now, simply trigger a manual flush at close time | |||
// on all the column families. | |||
if (immutable_db_options_.experimental_allow_mempurge) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge)
code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.
db/db_flush_test.cc
Outdated
|
||
const uint32_t mempurge_count_record = mempurge_count; | ||
|
||
// Insertion of of K-V pairs, multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps
// Insertion of K-V pairs, no overwrites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/db_flush_test.cc
Outdated
const size_t RAND_VALUES_LENGTH = 512; | ||
bool atLeastOneFlush = false; | ||
const size_t NUM_REPEAT = 1000; | ||
const size_t RAND_VALUES_LENGTH = 20480; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).
db/flush_job.cc
Outdated
@@ -306,6 +317,272 @@ void FlushJob::Cancel() { | |||
base_->Unref(); | |||
} | |||
|
|||
Status FlushJob::MemPurge(autovector<MemTable*>& purged_mems) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google code style does not allow non-const reference parameters (so that you can tell at the call site without checking parameters types what might be modified).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated - thanks for the note!
db/flush_job.cc
Outdated
// Store the full output memtables in | ||
// autovector "purged_mems". | ||
// autovector<MemTable*> purged_mems = {}; | ||
purged_mems = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for an accumulator output parameter, you have essentially three options:
- assert it's already empty
- just add to what's already there
- blindly replace anything that might be there
The last of these is my least favorite because it seems inherently unsafe from a maintenance perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true - I just updated the code with with the "assert it's already empty" option.
db/flush_job.cc
Outdated
@@ -297,6 +303,11 @@ Status FlushJob::Run(LogsWithPrepTracker* prep_tracker, | |||
<< (IOSTATS(cpu_read_nanos) - prev_cpu_read_nanos); | |||
} | |||
|
|||
// Clean up mempurge output memtables flushed to SST. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushed to SST? I am confused about what purged_mems
holds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, the name purged_mems
is visibly confusing. I've replaced it with "full_output_mems
". It contains the memtables resulting from the mempurge that are filled up to capacity and need to be flushed to storage.
db/flush_job.cc
Outdated
for (auto newm : purged_mems) { | ||
// Paranoia | ||
if (newm != nullptr) { | ||
delete newm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double delete between here and caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch good catch.
db/flush_job.cc
Outdated
// but write any full output table to level0. | ||
if (s.ok()) { | ||
TEST_SYNC_POINT("DBImpl::FlushJob:MemPurgeSuccessful"); | ||
for (MemTable* m : mems_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we sure that mems_ here is the same as above? Does the flush thread effectively take ownership of mems_? Can we make that an assertion, or simply not rely on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: we are guaranteed that mems_ here is the same as above - no other flush thread can edit these memtables at the same time, and even additional operations like the DB::Close
wait for all the flushes to happen before interfering with any of this.
Long answer: mems_
is a variable that belongs to the FlushJob
object, and the FlushJob
object is created inside each flush thread (details at db_impl_compaction_flush.cc:154). mems_
simply contains the pointer addresses of the imm()
memtables.
mems_
is populated by the column family data object when the db_mutex is held (FlushJob::PickMemTable
). Upon populating mems_
, the cfd
object mark that these memtables are being flushed by a flush thread somewhere ("flush_in_progress_ = true
", memtable_list.cc:357), which guarantees that nothing happens to the memtables contained in the mems_
object while the flush job is handling them.
Therefore we can be sure that there is no race condition thanks to the db_mutex
, and since the memtables are marked with "flush_in_progress=true
", we know that these tables wont be edited while we are working on them in the flush process.
// If mempurge successful, don't write input tables to level0, | ||
// but write any full output table to level0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the track you're taking here is keeping some data in memory only to maximize the chances that it becomes garbage and never has to be flushed. My inclination is (per column family) to flush all or nothing, for the benefit of write amplification in compaction. When we create an L0 file, it should be as big as possible. This should also make it easier to purge obsolete WAL files in due course.
Doesn't necessarily have to be fixed now, but something to be noted for reconsideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, provided we're allowed to go beyond the size of a regular SST file (so that we dont accidentally create 2 sst files, one of them being full, the other one filled at xx% capacity). According to the builder.cc:BuildTable code, it looks like we would indeed create a single large SST file, which would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this comment, i actually thought that would lead to the creation of more than 1 sst file because i was convinced the SST file size was enforced. But after code inspection it looks like we would just write a single big SST file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SST file size limit is for breaking up sorted runs, primarily so that leveled compaction in L1 and above can operate on parts of sorted runs with reasonable granularity. L0 is special because (AFAIK) we assume each SST file is its own sorted run. :)
…apacity. Also, speed up unit test by decreasing memtable size. Add defautl value of mempurged_ to RollbackMemtablepurge.
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions / fixes remain. Otherwise looking good :)
// autovector<MemTable*> purged_mems = {}; | ||
purged_mems = {}; | ||
|
||
MemTable* new_mem = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use .release()
db/flush_job.cc
Outdated
Status mempurge_s = MemPurge(purged_mems); | ||
Status mempurge_s = MemPurge(); | ||
if (!mempurge_s.ok()) { | ||
ROCKS_LOG_INFO(db_options_.info_log, "Mempurge process unsuccessful."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include status details, e.g. from ToString(). I'm sure you can find an example to copy.
It seems like an Aborted status should just be INFO but any other status is maybe WARN.
db/flush_job.cc
Outdated
(cfd_->GetFlushReason() == FlushReason::kWriteBufferFull) && | ||
(!mems_.empty())) { | ||
Status mempurge_s = MemPurge(purged_mems); | ||
} | ||
// This will release and re-acquire the mutex. | ||
Status s = WriteLevel0Table(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A smart person just suggested this "purged" state might be converted to temporary state here in FlushJob::Run ;)
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@bjlemaire merged this pull request in 837705a. |
In #8454, I introduced a new process baptized
MemPurge
(memtable garbage collection). This new PR is built upon this past mempurge prototype.In this PR, I made the
mempurge
process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the
imm()
memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.MemPurge is activated by making the
experimental_allow_mempurge
flagtrue
. When activated, theMemPurge
process will always happen when the flush reason iskWriteBufferFull
.The 3 unit tests confirm that this process supports
Put
,Get
,Delete
,DeleteRange
operators and is compatible withIterators
andCompactionFilters
.