Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand auto recovery to background read errors #9679

Closed
wants to merge 9 commits into from

Conversation

anand1976
Copy link
Contributor

Fix and enhance the background error recovery logic to handle the
following situations -

  1. Background read errors during flush/compaction (previously was
    resulting in unrecoverable state)
  2. Fix auto recovery failure on read/write errors during atomic flush.
    It was failing due to a bug in setting the resuming_from_bg_err variable
    in AtomicFlushMemTablesToOutputFiles.

Test plan:
Add new unit tests in error_handler_fs_test

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anand1976 for the PR. Left a few comments.

db/db_impl/db_impl_compaction_flush.cc Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Show resolved Hide resolved
@@ -316,24 +309,18 @@ Status DBImpl::FlushMemTableToOutputFile(
// error), all the Manifest write will be map to soft error.
// TODO: kManifestWriteNoWAL and kFlushNoWAL are misleading. Refactor is
// needed.
error_handler_.SetBGError(io_s,
error_handler_.SetBGError(s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe s can be non-ok and log_io_s is ok, while WAL has actually happened. Therefore, the ErrorReason here is incorrect? Or is the name kManifestWriteNoWAL confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks so according to the comment in line 310, and we can address this in future PRs. We can put some comments to explain the logic error handling based on reason, status, etc. in an easy-to-find place, if feasible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This could happen WAL or no WAL. Probably need a new reason, or maybe a structure with more context about the error.

db/db_impl/db_impl_compaction_flush.cc Show resolved Hide resolved
db/error_handler_fs_test.cc Outdated Show resolved Hide resolved
db/error_handler_fs_test.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand76 added 9 commits March 15, 2022 14:12
Summary:
Fix and enhance the background error recovery logic to handle the
following situations -
1. Background read errors during flush/compaction (previously was
resulting in unrecoverable state)
2. Fix auto recovery failure on read/write errors during atomic flush.
It was failing due to a bug in setting the resuming_from_bg_err variable
in AtomicFlushMemTablesToOutputFiles.
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants