Skip to content

Wraparound NextCompactionIndex in LevelCompactionBuilder#10355

Closed
mpoeter wants to merge 4 commits intofacebook:mainfrom
mpoeter:wraparound-NextCompactionIndex
Closed

Wraparound NextCompactionIndex in LevelCompactionBuilder#10355
mpoeter wants to merge 4 commits intofacebook:mainfrom
mpoeter:wraparound-NextCompactionIndex

Conversation

@mpoeter
Copy link
Contributor

@mpoeter mpoeter commented Jul 13, 2022

There can be situations where the picker is asked for a compaction twice for the same version. In such a scenario the first call might not be able to find any eligible files and as a result set NextCompactionIndex to the total number of files. In this case the second call could return null even though we could now compact one of the files we previously had to skip. This could then result in a deadlock as explained in more detail here: #10257 (comment)

This PR adds a unittest for LevelCompactionBuilder that simulates that situation and previously failed. As a solution I chose to keep the NextCompactionIndex but simply wraparound if we reach the end of the file list. So in the worst case we will now recheck all files. This was the easiest fix while at the same time only slightly changes the previous behavior. However, there are other options as well:

  • remove NextCompactionIndex completely and always start from zero
  • reorder the files while we are iterating over them and move files that are being compacted to the beginning and store the index from the first non-being-compacted file as NextCompactionIndex, so we can easily skip the files marked as being compacted (as we never need to reconsider those)
  • you can probably think of more...

I would like to get your feedback of this simple fix and whether you would prefer a different approach.

There can be situations where the picker is asked for a compaction twice
for the same version. In such a scenario the first call might not be able
to find any eligible files and as a result set NextCompactionIndex to the
total number of files. In this case the second call could return null even
though we could now compact one of the files we previously had to skip.
This could then result in a deadlock as explained in more detail here:
facebook#10257 (comment)
@jsteemann
Copy link
Contributor

With the currently released RocksDB versions, we keep running into the issue fixed by this PR. We keep seeing the write group leader hanging in a full write stop due to too many pending compaction bytes, but no further compactions being triggered.
That same deadlock situation often appears in our large scale ingestion workloads, and then renders the entire database process unusable.
We would be very interested in getting an "official" statement about the bugfix proposed in this PR, or any other alternative solutions if you would the bugfix to go into another direction. Thanks!

@jsteemann
Copy link
Contributor

Hello! It would be really awesome if this PR could get some attention, as the underlying problem is a serious issue (endless write stop, no further compactions) which affects our production usage of RocksDB.
Thanks!

@jsteemann
Copy link
Contributor

Any idea if/when this PR can considered for review? Thanks!

@neunhoef
Copy link
Contributor

Is there any progress on this? This bug is really annoying for our use of RocksDB in ArangoDB, since it can lead to a customer database being completely deadlocked.

@jsteemann
Copy link
Contributor

Any update on whether this PR can be accepted?
Without the changes from the PR in place, we have seen RocksDB go into an endless write stop sporadically.
Thank you for any update on this.

@jsteemann
Copy link
Contributor

According to our tests, this PR fixes an actual bug in the compaction picker, which can lead to an endless write stop and no further compactions being fired. We have observed the issue in the wild sporadically about one year ago.
It would be good to get any feedback on this PR and include it upstream, so we do not have to maintain the change in our RocksDB fork.
Thanks!

@cbi42
Copy link
Contributor

cbi42 commented Aug 18, 2023

Hi - thanks for debugging this issue and submitting a fix. I think I understand the cause of the "deadlock" situation, which is caused by the group commit of manifest writes/version creation AND how level compaction picker uses NextCompactionIndex. A repro timeline:

Suppose that L0 is also eligible for compaction and conflicts with L1 files that are being compacted by compact_thr2.
  time  compact_thr1    compact_thr2   
0  |  compact L5->L6    compact L1->L2
1  |  LogAndApply
2  |                    LogAndApply
3  |  compact_thr1 will write to manifest and create version for both threads           
4  |  write to manifest
5  |  create a new version V
6  |  release compaction
7  |  schedule new compaction, releases mutex
8  |  (some other compaction thread) pick compaction based on V, cannot pick L0 -> L1 since it overlaps with the L1->L2 compaction
9  |                    wake up
10 |                    release compaction
11 |                    schedule new compaction, releases mutex
12 |                    (some other compaction thread) pick compaction based on V, cannot pick L0 -> L1 since NextCompactionIndex() was set to total num of files
13 | Since writes are stopped, no more compaction is scheduled.
   V

For potential fixes, I think the solution in your PR should work. We can also not advance NextCompactionIndex() past the first file that cannot be compacted due to a conflict with an ongoing compaction.

I think a better solution similar to #6069 can be made to keep the optimization of advancing NextCompactionIndex(). Specifically, we can release compaction files in a manifest_write_cb that is passed to LogAndApply(). So when the new Version is visible to a compaction picker, the relevant compactions should already be released.

@jsteemann
Copy link
Contributor

@cbi42 : thanks for following up on this. As you are suggesting a potentially even better solution for the problem, is it possible for you or someone else from the RocksDB team to take over the fix, in favor of this PR?
Thanks!

@ajkr
Copy link
Contributor

ajkr commented Aug 23, 2023

Yes I talked to @cbi42 and believe he will try out the fix based on manifest_write_cb releasing files from compaction. Thanks very much for bringing this issue to our attention.

facebook-github-bot pushed a commit that referenced this pull request Sep 18, 2023
Summary:
Fixes #10257 (also see [here](#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: #11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
@jsteemann
Copy link
Contributor

@mpoeter : This PR can potentially be closed after the merge of #11764.

@mpoeter mpoeter closed this Sep 19, 2023
rockeet pushed a commit to topling/toplingdb that referenced this pull request Dec 19, 2023
Summary:
Fixes facebook/rocksdb#10257 (also see [here](facebook/rocksdb#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: facebook/rocksdb#11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
rockeet pushed a commit to topling/toplingdb that referenced this pull request Sep 1, 2024
Summary:
Fixes facebook/rocksdb#10257 (also see [here](facebook/rocksdb#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: facebook/rocksdb#11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants