Wraparound NextCompactionIndex in LevelCompactionBuilder#10355
Wraparound NextCompactionIndex in LevelCompactionBuilder#10355mpoeter wants to merge 4 commits intofacebook:mainfrom
Conversation
There can be situations where the picker is asked for a compaction twice for the same version. In such a scenario the first call might not be able to find any eligible files and as a result set NextCompactionIndex to the total number of files. In this case the second call could return null even though we could now compact one of the files we previously had to skip. This could then result in a deadlock as explained in more detail here: facebook#10257 (comment)
|
With the currently released RocksDB versions, we keep running into the issue fixed by this PR. We keep seeing the write group leader hanging in a full write stop due to too many pending compaction bytes, but no further compactions being triggered. |
|
Hello! It would be really awesome if this PR could get some attention, as the underlying problem is a serious issue (endless write stop, no further compactions) which affects our production usage of RocksDB. |
|
Any idea if/when this PR can considered for review? Thanks! |
|
Is there any progress on this? This bug is really annoying for our use of RocksDB in ArangoDB, since it can lead to a customer database being completely deadlocked. |
|
Any update on whether this PR can be accepted? |
|
According to our tests, this PR fixes an actual bug in the compaction picker, which can lead to an endless write stop and no further compactions being fired. We have observed the issue in the wild sporadically about one year ago. |
|
Hi - thanks for debugging this issue and submitting a fix. I think I understand the cause of the "deadlock" situation, which is caused by the group commit of manifest writes/version creation AND how level compaction picker uses NextCompactionIndex. A repro timeline: For potential fixes, I think the solution in your PR should work. We can also not advance NextCompactionIndex() past the first file that cannot be compacted due to a conflict with an ongoing compaction. I think a better solution similar to #6069 can be made to keep the optimization of advancing NextCompactionIndex(). Specifically, we can release compaction files in a |
|
@cbi42 : thanks for following up on this. As you are suggesting a potentially even better solution for the problem, is it possible for you or someone else from the RocksDB team to take over the fix, in favor of this PR? |
|
Yes I talked to @cbi42 and believe he will try out the fix based on |
Summary: Fixes #10257 (also see [here](#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: #11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Summary: Fixes facebook/rocksdb#10257 (also see [here](facebook/rocksdb#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: facebook/rocksdb#11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Summary: Fixes facebook/rocksdb#10257 (also see [here](facebook/rocksdb#10355 (comment))) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](https://github.com/facebook/rocksdb/blob/ba597514309b686d8addb59616f067d5522186b7/db/version_set.h#L1199) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: facebook/rocksdb#11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
There can be situations where the picker is asked for a compaction twice for the same version. In such a scenario the first call might not be able to find any eligible files and as a result set NextCompactionIndex to the total number of files. In this case the second call could return null even though we could now compact one of the files we previously had to skip. This could then result in a deadlock as explained in more detail here: #10257 (comment)
This PR adds a unittest for LevelCompactionBuilder that simulates that situation and previously failed. As a solution I chose to keep the NextCompactionIndex but simply wraparound if we reach the end of the file list. So in the worst case we will now recheck all files. This was the easiest fix while at the same time only slightly changes the previous behavior. However, there are other options as well:
I would like to get your feedback of this simple fix and whether you would prefer a different approach.