Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarantine files in a limbo state after a manifest error #12030

Closed
wants to merge 3 commits into from

Conversation

jowlyzhang
Copy link
Contributor

Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) EnableFileDeletion has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in ErrorHandler.

These are the types of files being actively tracked in quarantine in this PR:

  1. new table files and blob files from a background job
  2. old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.

Note that WAL logs are not part of the quarantine because min_log_number_to_keep is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.

We track these files' file numbers because they share the same file number space.

Test plan:
Modified existing unit tests

@jowlyzhang jowlyzhang force-pushed the quarantine_limbo_files branch 2 times, most recently from dc25872 to 8305ff5 Compare October 30, 2023 18:34
@jowlyzhang jowlyzhang marked this pull request as draft October 30, 2023 18:46
@jowlyzhang jowlyzhang marked this pull request as ready for review October 31, 2023 18:26
@jowlyzhang jowlyzhang force-pushed the quarantine_limbo_files branch 2 times, most recently from bc887a9 to 0d8e211 Compare October 31, 2023 20:31
@jowlyzhang jowlyzhang requested a review from ajkr October 31, 2023 22:21
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang force-pushed the quarantine_limbo_files branch from 0d8e211 to b2e79a3 Compare November 9, 2023 00:59
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment on lines 5191 to 5192
std::vector<const std::vector<uint64_t>*> files_to_quarantine_if_commit_fail;
std::vector<uint64_t> limbo_descriptor_log_file_number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does autovector work for these types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autovector works well. Thank you for the suggestion!

Comment on lines 5527 to 5529
// TODO: when CURRENT file rename fails, should the in memory state
// continue to be updated to use the new descriptor, should we do:
// manifest_io_status = io_s;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. The CURRENT rename might have succeeded and returned an error, in which case a restart will look for the new MANIFEST. But if we set manifest_io_status to non-OK it looks like it cleans up the new MANIFEST. I think keeping both like you're doing below is safest although if there's a cleaner way to do it correctly that'd be a welcome improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed context. I agree that keeping both is safest. I removed this TODO since there isn't too much of a follow up needed here.

@jowlyzhang jowlyzhang force-pushed the quarantine_limbo_files branch from b2e79a3 to c5fc849 Compare November 11, 2023 05:56
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang
Copy link
Contributor Author

LGTM, thanks!

Thank you for help review the PR.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 509947c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants