Wraparound NextCompactionIndex in LevelCompactionBuilder by mpoeter · Pull Request #10355 · facebook/rocksdb

mpoeter · 2022-07-13T19:27:45Z

There can be situations where the picker is asked for a compaction twice for the same version. In such a scenario the first call might not be able to find any eligible files and as a result set NextCompactionIndex to the total number of files. In this case the second call could return null even though we could now compact one of the files we previously had to skip. This could then result in a deadlock as explained in more detail here: #10257 (comment)

This PR adds a unittest for LevelCompactionBuilder that simulates that situation and previously failed. As a solution I chose to keep the NextCompactionIndex but simply wraparound if we reach the end of the file list. So in the worst case we will now recheck all files. This was the easiest fix while at the same time only slightly changes the previous behavior. However, there are other options as well:

remove NextCompactionIndex completely and always start from zero
reorder the files while we are iterating over them and move files that are being compacted to the beginning and store the index from the first non-being-compacted file as NextCompactionIndex, so we can easily skip the files marked as being compacted (as we never need to reconsider those)
you can probably think of more...

I would like to get your feedback of this simple fix and whether you would prefer a different approach.

There can be situations where the picker is asked for a compaction twice for the same version. In such a scenario the first call might not be able to find any eligible files and as a result set NextCompactionIndex to the total number of files. In this case the second call could return null even though we could now compact one of the files we previously had to skip. This could then result in a deadlock as explained in more detail here: facebook#10257 (comment)

jsteemann · 2022-07-25T09:53:19Z

With the currently released RocksDB versions, we keep running into the issue fixed by this PR. We keep seeing the write group leader hanging in a full write stop due to too many pending compaction bytes, but no further compactions being triggered.
That same deadlock situation often appears in our large scale ingestion workloads, and then renders the entire database process unusable.
We would be very interested in getting an "official" statement about the bugfix proposed in this PR, or any other alternative solutions if you would the bugfix to go into another direction. Thanks!

jsteemann · 2022-08-04T08:23:47Z

Hello! It would be really awesome if this PR could get some attention, as the underlying problem is a serious issue (endless write stop, no further compactions) which affects our production usage of RocksDB.
Thanks!

jsteemann · 2022-08-16T16:19:53Z

Any idea if/when this PR can considered for review? Thanks!

neunhoef · 2022-09-12T09:55:06Z

Is there any progress on this? This bug is really annoying for our use of RocksDB in ArangoDB, since it can lead to a customer database being completely deadlocked.

jsteemann · 2023-04-12T16:32:29Z

Any update on whether this PR can be accepted?
Without the changes from the PR in place, we have seen RocksDB go into an endless write stop sporadically.
Thank you for any update on this.

jsteemann · 2023-08-17T10:37:12Z

According to our tests, this PR fixes an actual bug in the compaction picker, which can lead to an endless write stop and no further compactions being fired. We have observed the issue in the wild sporadically about one year ago.
It would be good to get any feedback on this PR and include it upstream, so we do not have to maintain the change in our RocksDB fork.
Thanks!

cbi42 · 2023-08-18T18:48:13Z

Hi - thanks for debugging this issue and submitting a fix. I think I understand the cause of the "deadlock" situation, which is caused by the group commit of manifest writes/version creation AND how level compaction picker uses NextCompactionIndex. A repro timeline:

Suppose that L0 is also eligible for compaction and conflicts with L1 files that are being compacted by compact_thr2.
  time  compact_thr1    compact_thr2   
0  |  compact L5->L6    compact L1->L2
1  |  LogAndApply
2  |                    LogAndApply
3  |  compact_thr1 will write to manifest and create version for both threads           
4  |  write to manifest
5  |  create a new version V
6  |  release compaction
7  |  schedule new compaction, releases mutex
8  |  (some other compaction thread) pick compaction based on V, cannot pick L0 -> L1 since it overlaps with the L1->L2 compaction
9  |                    wake up
10 |                    release compaction
11 |                    schedule new compaction, releases mutex
12 |                    (some other compaction thread) pick compaction based on V, cannot pick L0 -> L1 since NextCompactionIndex() was set to total num of files
13 | Since writes are stopped, no more compaction is scheduled.
   V

For potential fixes, I think the solution in your PR should work. We can also not advance NextCompactionIndex() past the first file that cannot be compacted due to a conflict with an ongoing compaction.

I think a better solution similar to #6069 can be made to keep the optimization of advancing NextCompactionIndex(). Specifically, we can release compaction files in a manifest_write_cb that is passed to LogAndApply(). So when the new Version is visible to a compaction picker, the relevant compactions should already be released.

jsteemann · 2023-08-21T10:16:55Z

@cbi42 : thanks for following up on this. As you are suggesting a potentially even better solution for the problem, is it possible for you or someone else from the RocksDB team to take over the fix, in favor of this PR?
Thanks!

ajkr · 2023-08-23T00:27:13Z

Yes I talked to @cbi42 and believe he will try out the fix based on manifest_write_cb releasing files from compaction. Thanks very much for bringing this issue to our attention.

Summary: Fixes #10257 (also see [here](#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: #11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77

jsteemann · 2023-09-19T09:37:00Z

@mpoeter : This PR can potentially be closed after the merge of #11764.

Summary: Fixes facebook/rocksdb#10257 (also see [here](facebook/rocksdb#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: facebook/rocksdb#11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77

facebook-github-bot added the CLA Signed label Jul 13, 2022

Fix compiler warning

cb5fba1

Merge branch 'main' into wraparound-NextCompactionIndex

35e55ba

This was referenced Jul 25, 2022

Fix potential deadlock in rocksdb compaction. arangodb/arangodb#16597

Merged

[3.9] Fix potential deadlock in rocksdb compaction. arangodb/arangodb#16598

Merged

Fix tests.

04c394c

mpoeter mentioned this pull request Jul 26, 2022

[3.10] Fix potential deadlock in rocksdb compaction. arangodb/arangodb#16607

Merged

4 tasks

mrambacher mentioned this pull request Aug 29, 2022

Go over open RocksDB PRs and evaluate their relevance speedb-io/speedb#350

Closed

cbi42 mentioned this pull request Aug 26, 2023

Release compaction files in manifest write callback #11764

Closed

mpoeter closed this Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wraparound NextCompactionIndex in LevelCompactionBuilder#10355

Wraparound NextCompactionIndex in LevelCompactionBuilder#10355
mpoeter wants to merge 4 commits intofacebook:mainfrom
mpoeter:wraparound-NextCompactionIndex

mpoeter commented Jul 13, 2022

Uh oh!

jsteemann commented Jul 25, 2022

Uh oh!

jsteemann commented Aug 4, 2022

Uh oh!

jsteemann commented Aug 16, 2022

Uh oh!

neunhoef commented Sep 12, 2022

Uh oh!

jsteemann commented Apr 12, 2023

Uh oh!

jsteemann commented Aug 17, 2023

Uh oh!

cbi42 commented Aug 18, 2023 •

edited

Loading

Uh oh!

jsteemann commented Aug 21, 2023

Uh oh!

ajkr commented Aug 23, 2023

Uh oh!

jsteemann commented Sep 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

mpoeter commented Jul 13, 2022

Uh oh!

jsteemann commented Jul 25, 2022

Uh oh!

jsteemann commented Aug 4, 2022

Uh oh!

jsteemann commented Aug 16, 2022

Uh oh!

neunhoef commented Sep 12, 2022

Uh oh!

jsteemann commented Apr 12, 2023

Uh oh!

jsteemann commented Aug 17, 2023

Uh oh!

cbi42 commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsteemann commented Aug 21, 2023

Uh oh!

ajkr commented Aug 23, 2023

Uh oh!

jsteemann commented Sep 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cbi42 commented Aug 18, 2023 •

edited

Loading