Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB::GetSortedWalFiles() to ensure file deletion is disabled #8591

Closed
wants to merge 7 commits into from

Conversation

siying
Copy link
Contributor

@siying siying commented Jul 28, 2021

Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.

Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.

Test Plan: Run all existing tests

s = DisableFileDeletionsWithLock();
my_disable_delete_obsolete_files = disable_delete_obsolete_files_;
}
if (my_disable_delete_obsolete_files == 1) {
ROCKS_LOG_INFO(immutable_db_options_.info_log, "File Deletions Disabled");
} else {
ROCKS_LOG_WARN(immutable_db_options_.info_log,
"File Deletions Disabled, but already disabled. Counter: %d",
disable_delete_obsolete_files_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be my_disable_delete_obsolete_files (field is mutex guarded)

@@ -126,8 +127,23 @@ Status DBImpl::GetSortedWalFiles(VectorLogPtr& files) {
(pending_purge_obsolete_files_ > 0 || bg_purge_scheduled_ > 0)) {
bg_cv_.Wait();
}

// If the called didn't disable deletion, we should do it to avoid the case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that another thread might have disabled file deletion and re-enable while this thread is expecting them to be disabled. For safety, I don't think it's a big deal to always disable (increment) and re-enable (decrement) considering how expensive GetSortedWalFiles is already (e.g. iterates over all SST files in the DB dir in most cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to minimize behavior change for normal path to minimize the risk. Now changed to the way you suggested.

@jay-zhuang
Copy link
Contributor

I have a different patch to address the problem #8573 , which just ignores the missing WAL, seems there's already logic to ignore deleted WAL in archived folder, I think it should also be safe do that for Live folder.

@siying
Copy link
Contributor Author

siying commented Jul 28, 2021

I have a different patch to address the problem #8573 , which just ignores the missing WAL, seems there's already logic to ignore deleted WAL in archived folder, I think it should also be safe do that for Live folder.

@pdillinger doesn't like this approach. I don't have strong opinion which way to go. Would be OK to go either way after you decide.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying added 7 commits July 28, 2021 15:31
Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.

Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.

Test Plan: Run all existing tests
@siying siying force-pushed the wal_list_withoutlock branch from 623d529 to 415b225 Compare July 28, 2021 22:31
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e8f218c.

@ltamasi
Copy link
Contributor

ltamasi commented Aug 9, 2021

Hm, seems to me this approach makes what we're trying to avoid more likely not less. EnableFileDeletions calls FindObsoleteFiles / PurgeObsoleteFiles, so technically GetSortedWalFiles may now invalidate its own results before returning. I suspect this is behind the frequent DBTestXactLogIterator.TransactionLogIteratorCorruptedLog failures we've been seeing recently.

@siying
Copy link
Contributor Author

siying commented Aug 9, 2021

Hm, seems to me this approach makes what we're trying to avoid more likely not less. EnableFileDeletions calls FindObsoleteFiles / PurgeObsoleteFiles, so technically GetSortedWalFiles may now invalidate its own results before returning. I suspect this is behind the frequent DBTestXactLogIterator.TransactionLogIteratorCorruptedLog failures we've been seeing recently.

This fix is to try to avoid failure of DB::GetSortedWalFiles(). That is, returning an IO Error. It is not to prevent the case where the file returned doesn't exist anymore. What you said might be correct, as it is more likely that returned log files don't exist if a user doesn't disable file deletion by themselves. But this isn't violate anything, as if they don't prevent file deletion, this deletion could happen between function returns and they consume the data anyway. I think in this case, fixing the unit test might be the better way to do. I can go ahead and disable file deletion around the logic in DBTestXactLogIterator.TransactionLogIteratorCorruptedLog if you think that's the right thing to do. Thoughts?

@ltamasi
Copy link
Contributor

ltamasi commented Aug 9, 2021

Yeah, what I meant is that while prior to this change, when there was a background job doing a purge competing with a GetSortedWalFiles call, you could get lucky and see the updated state but now we're guaranteed to get unlucky. I agree that this is not an API violation though.

BTW, I'm already working on deflaking DBTestXactLogIterator.TransactionLogIteratorCorruptedLog; disabling deletions is probably not an option in that case though because the test relies on an archived WAL file (which is also produced by the purge logic).

@siying
Copy link
Contributor Author

siying commented Aug 9, 2021

Yeah, what I meant is that while prior to this change, when there was a background job doing a purge competing with a GetSortedWalFiles call, you could get lucky and see the updated state but now we're guaranteed to get _un_lucky. I agree that this is not an API violation though.

BTW, I'm already working on deflaking DBTestXactLogIterator.TransactionLogIteratorCorruptedLog; disabling deletions is probably not an option in that case though because the test relies on an archived WAL file (which is also produced by the purge logic).

Would disabling file deletion before GetSortedWalFiles() and re-enabling after truncating the file work?

@ltamasi
Copy link
Contributor

ltamasi commented Aug 9, 2021

Would disabling file deletion before GetSortedWalFiles() and re-enabling after truncating the file work?

Actually. that might work; it would change the test in the sense that we might now be truncating a not-yet-archived file but that's probably fine. I'll give it a shot.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 23, 2021
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`

Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:

* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.

* The fix to facebook#8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).

Test Plan: existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)
facebook-github-bot pushed a commit that referenced this pull request Nov 24, 2021
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`

Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:

* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.

* The fix to #8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).

Pull Request resolved: #9208

Test Plan:
existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)

Reviewed By: ajkr

Differential Revision: D32630675

Pulled By: pdillinger

fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 29, 2021
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`

Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:

* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.

* The fix to facebook#8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).

Pull Request resolved: facebook#9208

Test Plan:
existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)

Reviewed By: ajkr

Differential Revision: D32630675

Pulled By: pdillinger

fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants