-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB::GetSortedWalFiles() to ensure file deletion is disabled #8591
Conversation
db/db_impl/db_impl_files.cc
Outdated
s = DisableFileDeletionsWithLock(); | ||
my_disable_delete_obsolete_files = disable_delete_obsolete_files_; | ||
} | ||
if (my_disable_delete_obsolete_files == 1) { | ||
ROCKS_LOG_INFO(immutable_db_options_.info_log, "File Deletions Disabled"); | ||
} else { | ||
ROCKS_LOG_WARN(immutable_db_options_.info_log, | ||
"File Deletions Disabled, but already disabled. Counter: %d", | ||
disable_delete_obsolete_files_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be my_disable_delete_obsolete_files
(field is mutex guarded)
db/db_filesnapshot.cc
Outdated
@@ -126,8 +127,23 @@ Status DBImpl::GetSortedWalFiles(VectorLogPtr& files) { | |||
(pending_purge_obsolete_files_ > 0 || bg_purge_scheduled_ > 0)) { | |||
bg_cv_.Wait(); | |||
} | |||
|
|||
// If the called didn't disable deletion, we should do it to avoid the case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that another thread might have disabled file deletion and re-enable while this thread is expecting them to be disabled. For safety, I don't think it's a big deal to always disable (increment) and re-enable (decrement) considering how expensive GetSortedWalFiles is already (e.g. iterates over all SST files in the DB dir in most cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to minimize behavior change for normal path to minimize the risk. Now changed to the way you suggested.
I have a different patch to address the problem #8573 , which just ignores the missing WAL, seems there's already logic to ignore deleted WAL in archived folder, I think it should also be safe do that for Live folder. |
@pdillinger doesn't like this approach. I don't have strong opinion which way to go. Would be OK to go either way after you decide. |
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@siying has updated the pull request. You must reimport the pull request before landing. |
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Test Plan: Run all existing tests
623d529
to
415b225
Compare
@siying has updated the pull request. You must reimport the pull request before landing. |
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
1 similar comment
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
This pull request has been merged in e8f218c. |
Hm, seems to me this approach makes what we're trying to avoid more likely not less. |
This fix is to try to avoid failure of DB::GetSortedWalFiles(). That is, returning an IO Error. It is not to prevent the case where the file returned doesn't exist anymore. What you said might be correct, as it is more likely that returned log files don't exist if a user doesn't disable file deletion by themselves. But this isn't violate anything, as if they don't prevent file deletion, this deletion could happen between function returns and they consume the data anyway. I think in this case, fixing the unit test might be the better way to do. I can go ahead and disable file deletion around the logic in DBTestXactLogIterator.TransactionLogIteratorCorruptedLog if you think that's the right thing to do. Thoughts? |
Yeah, what I meant is that while prior to this change, when there was a background job doing a purge competing with a BTW, I'm already working on deflaking |
Would disabling file deletion before GetSortedWalFiles() and re-enabling after truncating the file work? |
Actually. that might work; it would change the test in the sense that we might now be truncating a not-yet-archived file but that's probably fine. I'll give it a shot. |
Summary: Saw error like this: `Backup failed -- IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or directory` Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.) relies on no file deletions happening while its operating, which means not only disabling (more) deletions, but ensuring any pending deletions are completed. Two fixes related to this: * There was a gap in several places between decrementing pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where the db mutex would be released and GetSortedWalFiles (and others) could get false information that no deletions are pending. * The fix to facebook#8591 (disabling deletions in GetSortedWalFiles) seems incomplete because it doesn't prevent pending deletions from occuring during the operation (if deletions not already disabled, the case that was to be fixed by the change). Test Plan: existing tests (it's hard to write a test for interleavings that are now excluded - this is what stress test is for)
Summary: Saw error like this: `Backup failed -- IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or directory` Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.) relies on no file deletions happening while its operating, which means not only disabling (more) deletions, but ensuring any pending deletions are completed. Two fixes related to this: * There was a gap in several places between decrementing pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where the db mutex would be released and GetSortedWalFiles (and others) could get false information that no deletions are pending. * The fix to #8591 (disabling deletions in GetSortedWalFiles) seems incomplete because it doesn't prevent pending deletions from occuring during the operation (if deletions not already disabled, the case that was to be fixed by the change). Pull Request resolved: #9208 Test Plan: existing tests (it's hard to write a test for interleavings that are now excluded - this is what stress test is for) Reviewed By: ajkr Differential Revision: D32630675 Pulled By: pdillinger fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Summary: Saw error like this: `Backup failed -- IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or directory` Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.) relies on no file deletions happening while its operating, which means not only disabling (more) deletions, but ensuring any pending deletions are completed. Two fixes related to this: * There was a gap in several places between decrementing pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where the db mutex would be released and GetSortedWalFiles (and others) could get false information that no deletions are pending. * The fix to facebook#8591 (disabling deletions in GetSortedWalFiles) seems incomplete because it doesn't prevent pending deletions from occuring during the operation (if deletions not already disabled, the case that was to be fixed by the change). Pull Request resolved: facebook#9208 Test Plan: existing tests (it's hard to write a test for interleavings that are now excluded - this is what stress test is for) Reviewed By: ajkr Differential Revision: D32630675 Pulled By: pdillinger fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.
Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.
Test Plan: Run all existing tests