Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make mempurge a background process (equivalent to in-memory compaction). #8505

Closed

Conversation

bjlemaire
Copy link
Contributor

@bjlemaire bjlemaire commented Jul 8, 2021

In #8454, I introduced a new process baptized MemPurge (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the mempurge process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the imm() memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the experimental_allow_mempurge flag true. When activated, the MemPurge process will always happen when the flush reason is kWriteBufferFull.
The 3 unit tests confirm that this process supports Put, Get, Delete, DeleteRange operators and is compatible with Iterators and CompactionFilters.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

…ushed to storage. Add manual flush to DB close() function for extra safety.
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

const size_t RAND_VALUES_LENGTH = 512;
bool atLeastOneFlush = false;
const size_t NUM_REPEAT = 1000;
const size_t RAND_VALUES_LENGTH = 20480;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option to lower these values (and speed up the test) would be to change the memtable size (right now: uses the default 64MB size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are definitely on the slower end of the spectrum (all 3 tests take about 3 seconds each with a regular "make db_flush_test"). I can look at bringing them under 1sec each.

@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() {
flush_scheduler_.Clear();
trim_history_scheduler_.Clear();

// For now, simply trigger a manual flush at close time
// on all the column families.
if (immutable_db_options_.experimental_allow_mempurge) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, automagically flush all column families. In the future I can do a more fine-grained flushing by first checking if there is a need for flushing (but need to implement something else than imm()->IsFlushPending() because the output memtables added to imm() dont trigger flushes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge) code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.

db/flush_job.cc Outdated
// only if it filled at less than 10% capacity (arbitrary heuristic).
if (new_mem->ApproximateMemoryUsage() <
static_cast<size_t>(
ceil(0.1 * mutable_cf_options_.write_buffer_size))) {
Copy link
Contributor Author

@bjlemaire bjlemaire Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary metric: add to imm() only if memtable at less then 10% capacity. Reason for this: minimize Get/Read overheads that come from storing an extra Imm memtable, while giving us a chance to perform in memory compaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, could avoid floating point with

(mutable_cf_options_.write_buffer_size + 9) / 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick, I'll go ahead and edit that (even though long term there is a chance we need the flexibility a double would bring).

db/flush_job.cc Outdated Show resolved Hide resolved
db/flush_job.cc Outdated
m->NewRangeTombstoneIterator(ro, kMaxSequenceNumber);
if (range_del_iter != nullptr) {
range_del_iters.emplace_back(range_del_iter);
if (!(m->GetMempurged())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScopedArenaIterator iter(
NewMergingIterator(&cfd_->internal_comparator(), &memtables[0],
NewMergingIterator(&cfd_->internal_comparator(), memtables.data(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of vector.data() is preferable (allowed in C++11), because &memtables[0] leads to an undefined behavior if memtables is empty (I also added an if-statment, pure paranoia).

@@ -527,7 +528,8 @@ void MemTableList::Add(MemTable* m, autovector<MemTable*>* to_delete) {
current_->Add(m, to_delete);
m->MarkImmutable();
num_flush_not_started_++;
if (num_flush_not_started_ == 1) {

if (num_flush_not_started_ > 0 && trigger_flush) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one thing that could potentially lead to issues, maybe? Before, they stored "flush_needed" as soon as num_flush_not_started reached exactly 1. Why didnt they use num_flush_not_started>0, which should have been strictly equivalent? Either way, here we still need to increment num_flush_not_started but we dont want to trigger flush, otherwise the flush will keep spinning.

db/flush_job.cc Outdated
// number) needs to be present in the new memtable.
new_mem->SetFirstSequenceNumber(new_first_seqno);
purged_mems.push_back(new_mem);
new_mem =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy is also probably something we want to discuss: should we create new memtables, or simply abort if we end up in this situation?
Basically: pros of creating new mems: we dont waste the time (possibly?) spent in mempurge.
Cons: spikes in memory usage by memtables.

@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() {
flush_scheduler_.Clear();
trim_history_scheduler_.Clear();

// For now, simply trigger a manual flush at close time
// on all the column families.
if (immutable_db_options_.experimental_allow_mempurge) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge) code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.


const uint32_t mempurge_count_record = mempurge_count;

// Insertion of of K-V pairs, multiple times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps

// Insertion of K-V pairs, no overwrites

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

const size_t RAND_VALUES_LENGTH = 512;
bool atLeastOneFlush = false;
const size_t NUM_REPEAT = 1000;
const size_t RAND_VALUES_LENGTH = 20480;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).

db/db_flush_test.cc Outdated Show resolved Hide resolved
db/flush_job.cc Outdated
@@ -306,6 +317,272 @@ void FlushJob::Cancel() {
base_->Unref();
}

Status FlushJob::MemPurge(autovector<MemTable*>& purged_mems) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google code style does not allow non-const reference parameters (so that you can tell at the call site without checking parameters types what might be modified).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - thanks for the note!

db/flush_job.cc Outdated
// Store the full output memtables in
// autovector "purged_mems".
// autovector<MemTable*> purged_mems = {};
purged_mems = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for an accumulator output parameter, you have essentially three options:

  • assert it's already empty
  • just add to what's already there
  • blindly replace anything that might be there

The last of these is my least favorite because it seems inherently unsafe from a maintenance perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true - I just updated the code with with the "assert it's already empty" option.

db/flush_job.cc Outdated
@@ -297,6 +303,11 @@ Status FlushJob::Run(LogsWithPrepTracker* prep_tracker,
<< (IOSTATS(cpu_read_nanos) - prev_cpu_read_nanos);
}

// Clean up mempurge output memtables flushed to SST.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushed to SST? I am confused about what purged_mems holds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, the name purged_mems is visibly confusing. I've replaced it with "full_output_mems". It contains the memtables resulting from the mempurge that are filled up to capacity and need to be flushed to storage.

db/flush_job.cc Outdated
for (auto newm : purged_mems) {
// Paranoia
if (newm != nullptr) {
delete newm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double delete between here and caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch good catch.

db/flush_job.cc Outdated
// but write any full output table to level0.
if (s.ok()) {
TEST_SYNC_POINT("DBImpl::FlushJob:MemPurgeSuccessful");
for (MemTable* m : mems_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that mems_ here is the same as above? Does the flush thread effectively take ownership of mems_? Can we make that an assertion, or simply not rely on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: we are guaranteed that mems_ here is the same as above - no other flush thread can edit these memtables at the same time, and even additional operations like the DB::Close wait for all the flushes to happen before interfering with any of this.
Long answer: mems_ is a variable that belongs to the FlushJob object, and the FlushJob object is created inside each flush thread (details at db_impl_compaction_flush.cc:154). mems_ simply contains the pointer addresses of the imm() memtables.
mems_ is populated by the column family data object when the db_mutex is held (FlushJob::PickMemTable). Upon populating mems_, the cfd object mark that these memtables are being flushed by a flush thread somewhere ("flush_in_progress_ = true", memtable_list.cc:357), which guarantees that nothing happens to the memtables contained in the mems_ object while the flush job is handling them.
Therefore we can be sure that there is no race condition thanks to the db_mutex, and since the memtables are marked with "flush_in_progress=true", we know that these tables wont be edited while we are working on them in the flush process.

Comment on lines +555 to +556
// If mempurge successful, don't write input tables to level0,
// but write any full output table to level0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the track you're taking here is keeping some data in memory only to maximize the chances that it becomes garbage and never has to be flushed. My inclination is (per column family) to flush all or nothing, for the benefit of write amplification in compaction. When we create an L0 file, it should be as big as possible. This should also make it easier to purge obsolete WAL files in due course.

Doesn't necessarily have to be fixed now, but something to be noted for reconsideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, provided we're allowed to go beyond the size of a regular SST file (so that we dont accidentally create 2 sst files, one of them being full, the other one filled at xx% capacity). According to the builder.cc:BuildTable code, it looks like we would indeed create a single large SST file, which would be ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this comment, i actually thought that would lead to the creation of more than 1 sst file because i was convinced the SST file size was enforced. But after code inspection it looks like we would just write a single big SST file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SST file size limit is for breaking up sorted runs, primarily so that leveled compaction in L1 and above can operate on parts of sorted runs with reasonable granularity. L0 is special because (AFAIK) we assume each SST file is its own sorted run. :)

…apacity. Also, speed up unit test by decreasing memtable size. Add defautl value of mempurged_ to RollbackMemtablepurge.
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions / fixes remain. Otherwise looking good :)

// autovector<MemTable*> purged_mems = {};
purged_mems = {};

MemTable* new_mem = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use .release()

db/flush_job.cc Outdated
Status mempurge_s = MemPurge(purged_mems);
Status mempurge_s = MemPurge();
if (!mempurge_s.ok()) {
ROCKS_LOG_INFO(db_options_.info_log, "Mempurge process unsuccessful.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include status details, e.g. from ToString(). I'm sure you can find an example to copy.

It seems like an Aborted status should just be INFO but any other status is maybe WARN.

db/flush_job.cc Outdated Show resolved Hide resolved
db/flush_job.cc Outdated
(cfd_->GetFlushReason() == FlushReason::kWriteBufferFull) &&
(!mems_.empty())) {
Status mempurge_s = MemPurge(purged_mems);
}
// This will release and re-acquire the mutex.
Status s = WriteLevel0Table();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A smart person just suggested this "purged" state might be converted to temporary state here in FlushJob::Run ;)

db/flush_job.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bjlemaire merged this pull request in 837705a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants