Skip to content

Perform post-flush updates of memtable list in a callback#6069

Closed
riversand963 wants to merge 3 commits intofacebook:masterfrom
riversand963:followup_6064
Closed

Perform post-flush updates of memtable list in a callback#6069
riversand963 wants to merge 3 commits intofacebook:masterfrom
riversand963:followup_6064

Conversation

@riversand963
Copy link
Contributor

@riversand963 riversand963 commented Nov 22, 2019

Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush 
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V

The reason behind is that: one bg flush thread's installing new Version can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after LogAndApply returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to LogAndApply. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)

$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check

@riversand963 riversand963 changed the title [WIP] Investigate a bug reported in #6064 [WIP] Investigate a bug related to superversion Nov 22, 2019
@riversand963 riversand963 added the WIP Work in progress label Nov 22, 2019
@riversand963
Copy link
Contributor Author

Cleaned up from #6064

@riversand963 riversand963 force-pushed the followup_6064 branch 2 times, most recently from c4b9a57 to 14be29e Compare November 24, 2019 07:32
@mm304321141
Copy link
Contributor

mm304321141 commented Nov 25, 2019

@riversand963
riversand963#3

We wrote some code illustrating this SuperVersion bug.
By the way, we do have a trivial fix of the bug.
Since it's not elegant, we are expecting your recommendations about this bug.

create a pull to show our trivial fix ...
it works ...

@riversand963
Copy link
Contributor Author

@riversand963
riversand963#3

We wrote some code illustrating this SuperVersion bug.
By the way, we do have a trivial fix of the bug.
Since it's not elegant, we are expecting your recommendations about this bug.

create a pull to show our trivial fix ...
it works ...

I am hesitant about adding a member variable, 'callback' of type std::function to VersionEdit. It is true that we can distinguish data that should be serialized and persisted from volatile data, but not sure if it is ideal.

@mm304321141
Copy link
Contributor

@riversand963
riversand963#3

We wrote some code illustrating this SuperVersion bug.
By the way, we do have a trivial fix of the bug.
Since it's not elegant, we are expecting your recommendations about this bug.

create a pull to show our trivial fix ...
it works ...

I am hesitant about adding a member variable, 'callback' of type std::function to VersionEdit. It is true that we can distinguish data that should be serialized and persisted from volatile data, but not sure if it is ideal.

Our online service needs to be fixed right away :(
The fix is rude and I don't prefer it ...
So we just report this issue to you (official) -.-

Adding a callback to LogAndApply is the same workaround ...
That I think adding callback to VersionEdit is more adaptable (in the future?).

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor

ajkr commented Dec 7, 2019

Can this bug also cause issue for SingleDelete()? For example, if a SINGLEDEL is in the mutable memtable, and the one PUT it covers is visible in both an immutable memtable and an L0 file, then I'd expect Get() to return the PUT from the L0 file.

@riversand963
Copy link
Contributor Author

riversand963 commented Dec 19, 2019

Can this bug also cause issue for SingleDelete()? For example, if a SINGLEDEL is in the mutable memtable, and the one PUT it covers is visible in both an immutable memtable and an L0 file, then I'd expect Get() to return the PUT from the L0 file.

I think it's possible, though I have not used sync points to verify.

Update:
@ajkr
According to https://github.com/facebook/rocksdb/blob/master/db/memtable.cc#L717:L730, looks like if Get() sees a single delete in mutable memtable, it will not search in imm or ssts. I tried the following test similar to the merge test in the PR, but haven't reproduced.

TEST_F(DBTest2, InstallSuperVersionForGroupCommittedFlush) {
  Options options = CurrentOptions();
  options.max_write_buffer_number = 3;
  DestroyAndReopen(options);
  ASSERT_OK(Put("bar", "value"));
  ASSERT_OK(Flush());
  ASSERT_OK(Put("bar", "value2"));
  ASSERT_OK(Flush());

  std::atomic<int> tid{0};
  const auto get_thread_id = [&tid]() {
    thread_local int thread_id{tid++};
    return thread_id;
  };
  SyncPoint::GetInstance()->DisableProcessing();
  SyncPoint::GetInstance()->ClearAllCallBacks();
  SyncPoint::GetInstance()->SetCallBack(
      "VersionSet::LogAndApply:BeforeWriterWaiting", [&](void* /*arg*/) {
        int thread_id = get_thread_id();
        if (1 == thread_id) {
          TEST_SYNC_POINT(
              "InstallSuperVersionWhenGroupWritingManifest::bg_compact_thr:0");
        } else if (2 == thread_id) {
          TEST_SYNC_POINT(
              "InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:0");
        }
      });
  SyncPoint::GetInstance()->SetCallBack(
      "VersionSet::LogAndApply:WriteManifest", [&](void* /*arg*/) {
        int thread_id = get_thread_id();
        if (0 == thread_id) {
          TEST_SYNC_POINT(
              "InstallSuperVersionWhenGroupWritingManifest::set_opts_thr:0");
          TEST_SYNC_POINT(
              "InstallSuperVersionWhenGroupWritingManifest::set_opts_thr:1");
        }
      });
  SyncPoint::GetInstance()->SetCallBack(
      "VersionSet::LogAndApply:WakeUpAndDone", [&](void* arg) {
        auto* mutex = reinterpret_cast<InstrumentedMutex*>(arg);
        mutex->AssertHeld();
        int thread_id = get_thread_id();
        ASSERT_EQ(2, thread_id);
        mutex->Unlock();
        TEST_SYNC_POINT(
            "InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:1");
        TEST_SYNC_POINT(
            "InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:2");
        mutex->Lock();
      });
  SyncPoint::GetInstance()->LoadDependency({
      {"InstallSuperVersionWhenGroupWritingManifest::set_opts_thr:0",
       "BackgroundCallCompaction:0"},
      {"InstallSuperVersionWhenGroupWritingManifest::bg_compact_thr:0",
       "DBImpl::BackgroundCallFlush:start"},
      {"InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:0",
       "InstallSuperVersionWhenGroupWritingManifest::set_opts_thr:1"},
      {"InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:1",
       "InstallSuperVersionWhenGroupWritingManifest::BeforeGet"},
      {"InstallSuperVersionWhenGroupWritingManifest::AfterGet",
       "InstallSuperVersionWhenGroupWritingManifest::bg_flush_thr:2"},
  });
  SyncPoint::GetInstance()->EnableProcessing();
  port::Thread set_opts_thread([&]() {
    ASSERT_OK(db_->SetOptions({{"max_sequential_skip_in_iterations", "4"}}));
  });
  port::Thread compact_thread([&]() {
    // Choose key range so as not to overlap with "foo". Otherwise,
    // CompactRange() will cause a flush, breaking the sync point dependencies.
    Slice begin_key("bar");
    Slice end_key("bbr");
    ASSERT_OK(db_->CompactRange(CompactRangeOptions(),
                                db_->DefaultColumnFamily(), &begin_key,
                                &end_key));
  });
  ASSERT_OK(Put("foo", "value0"));
  ASSERT_OK(dbfull()->TEST_SwitchMemtable(
      static_cast<ColumnFamilyHandleImpl*>(db_->DefaultColumnFamily())->cfd()));
  FlushOptions flush_opts;
  flush_opts.wait = false;
  ASSERT_OK(db_->Flush(flush_opts));

  ASSERT_OK(SingleDelete("foo"));
  std::string value;
  TEST_SYNC_POINT("InstallSuperVersionWhenGroupWritingManifest::BeforeGet");
  Status s = db_->Get(ReadOptions(), "foo", &value);
  TEST_SYNC_POINT("InstallSuperVersionWhenGroupWritingManifest::AfterGet");
  ASSERT_TRUE(s.IsNotFound());

  set_opts_thread.join();
  compact_thread.join();
}

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@riversand963 riversand963 removed the WIP Work in progress label Dec 19, 2019
@riversand963 riversand963 changed the title [WIP] Investigate a bug related to superversion Fix a bug related to superversion Dec 19, 2019
@riversand963 riversand963 requested a review from siying December 20, 2019 01:32
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@riversand963 riversand963 requested a review from a user August 28, 2020 01:27
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read every line of this method, but I assume it just wraps the deleted contents of this file.

db/merge_test.cc Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug may also affect delete? Can we add a test case for delete too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not. Can you elaborate on how delete can be affected?

@riversand963
Copy link
Contributor Author

Thanks @Cheng-Chang for the review!

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@riversand963 riversand963 changed the title Fix a bug related to superversion Perform post-flush updates of memtable list in a callback Oct 26, 2020
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@riversand963 riversand963 deleted the followup_6064 branch October 27, 2020 02:09
@facebook-github-bot
Copy link
Contributor

@riversand963 merged this pull request in 6134ce6.

* Since 6.14, fix false positive flush/compaction `Status::Corruption` failure when `paranoid_file_checks == true` and range tombstones were written to the compaction output files.
* Since 6.14, fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (`WriteOptions.no_slowdown=true`).
* Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.
* Perform post-flush updates of in-memory data structures of memtable list in a callback that will be called by the thread writing to MANIFEST file before signaling other threads and releasing db mutex.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the release notes, I wasn't able to think of the user-visible bug that this referred to. Only by looking at the PR I learned it's a Get() correctness issue. Is it possible to write the note more in terms of what a user would observe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will submit a PR to address your comment.

codingrhythm pushed a commit to SafetyCulture/rocksdb that referenced this pull request Mar 5, 2021
)

Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

```
  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V
```

The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```

Pull Request resolved: facebook#6069

Reviewed By: cheng-chang

Differential Revision: D18790894

Pulled By: riversand963

fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants