-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quarantine files in a limbo state after a manifest error #12030
Conversation
dc25872
to
8305ff5
Compare
bc887a9
to
0d8e211
Compare
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
0d8e211
to
b2e79a3
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
db/version_set.cc
Outdated
std::vector<const std::vector<uint64_t>*> files_to_quarantine_if_commit_fail; | ||
std::vector<uint64_t> limbo_descriptor_log_file_number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does autovector
work for these types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autovector
works well. Thank you for the suggestion!
db/version_set.cc
Outdated
// TODO: when CURRENT file rename fails, should the in memory state | ||
// continue to be updated to use the new descriptor, should we do: | ||
// manifest_io_status = io_s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. The CURRENT rename might have succeeded and returned an error, in which case a restart will look for the new MANIFEST. But if we set manifest_io_status
to non-OK it looks like it cleans up the new MANIFEST. I think keeping both like you're doing below is safest although if there's a cleaner way to do it correctly that'd be a welcome improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed context. I agree that keeping both is safest. I removed this TODO since there isn't too much of a follow up needed here.
b2e79a3
to
c5fc849
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thank you for help review the PR. |
@jowlyzhang merged this pull request in 509947c. |
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2)
EnableFileDeletion
has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted inErrorHandler
.These are the types of files being actively tracked in quarantine in this PR:
Note that WAL logs are not part of the quarantine because
min_log_number_to_keep
is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.We track these files' file numbers because they share the same file number space.
Test plan:
Modified existing unit tests