-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where retries within RemoteStoreRefreshListener cause infos/checkpoint mismatch #10655
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mch2
requested review from
reta,
anasalkouz,
andrross,
Bukhtawar,
CEHENKLE,
dblock,
gbbafna,
setiah,
kartg,
kotwanikunal,
nknize,
owaiskazi19,
peternied,
Rishikesh1159,
ryanbogan,
saratvemulapalli,
shwetathareja,
dreamer-89,
VachaShah,
dbwiddis,
sachinpkale,
sohami and
msfroh
as code owners
October 16, 2023 22:58
github-actions
bot
added
>test-failure
Test failure from CI, local build, etc.
bug
Something isn't working
flaky-test
Random test failure that succeeds on second run
Indexing
Indexing, Bulk Indexing and anything related to indexing
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Storage
Issues and PRs relating to data and metadata storage
labels
Oct 16, 2023
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
#10730 again |
Gradle Check (Jenkins) Run Completed with:
|
ashking94
approved these changes
Oct 19, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
mixed cluster bwc tests failing, believe this will pass once #10754 is merged and rebased |
Gradle Check (Jenkins) Run Completed with:
|
andrross
reviewed
Oct 19, 2023
…h between ReplicationCheckpoint and uploaded SegmentInfos. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution updating the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata. Signed-off-by: Marc Handalian <handalm@amazon.com> Update refresh listener to recompute checkpoint from latest infos snapshot. Signed-off-by: Marc Handalian <handalm@amazon.com> Fix broken test case by comparing segments gen Signed-off-by: Marc Handalian <handalm@amazon.com> spotless Signed-off-by: Marc Handalian <handalm@amazon.com> Fix RemoteStoreRefreshListener tests Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
opensearch-trigger-bot bot
pushed a commit
that referenced
this pull request
Oct 19, 2023
…heckpoint mismatch (#10655) * Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution updating the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata. Signed-off-by: Marc Handalian <handalm@amazon.com> Update refresh listener to recompute checkpoint from latest infos snapshot. Signed-off-by: Marc Handalian <handalm@amazon.com> Fix broken test case by comparing segments gen Signed-off-by: Marc Handalian <handalm@amazon.com> spotless Signed-off-by: Marc Handalian <handalm@amazon.com> Fix RemoteStoreRefreshListener tests Signed-off-by: Marc Handalian <handalm@amazon.com> * add extra log Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit e389a09) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2
pushed a commit
that referenced
this pull request
Oct 19, 2023
…heckpoint mismatch (#10655) (#10760) * Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution updating the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata. Update refresh listener to recompute checkpoint from latest infos snapshot. Fix broken test case by comparing segments gen spotless Fix RemoteStoreRefreshListener tests * add extra log --------- (cherry picked from commit e389a09) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee
pushed a commit
to austintlee/OpenSearch
that referenced
this pull request
Oct 23, 2023
…heckpoint mismatch (opensearch-project#10655) * Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution updating the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata. Signed-off-by: Marc Handalian <handalm@amazon.com> Update refresh listener to recompute checkpoint from latest infos snapshot. Signed-off-by: Marc Handalian <handalm@amazon.com> Fix broken test case by comparing segments gen Signed-off-by: Marc Handalian <handalm@amazon.com> spotless Signed-off-by: Marc Handalian <handalm@amazon.com> Fix RemoteStoreRefreshListener tests Signed-off-by: Marc Handalian <handalm@amazon.com> * add extra log Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
shiv0408
pushed a commit
to Gaurav614/OpenSearch
that referenced
this pull request
Apr 25, 2024
…heckpoint mismatch (opensearch-project#10655) * Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution updating the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata. Signed-off-by: Marc Handalian <handalm@amazon.com> Update refresh listener to recompute checkpoint from latest infos snapshot. Signed-off-by: Marc Handalian <handalm@amazon.com> Fix broken test case by comparing segments gen Signed-off-by: Marc Handalian <handalm@amazon.com> spotless Signed-off-by: Marc Handalian <handalm@amazon.com> Fix RemoteStoreRefreshListener tests Signed-off-by: Marc Handalian <handalm@amazon.com> * add extra log Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport 2.x
Backport to 2.x branch
bug
Something isn't working
flaky-test
Random test failure that succeeds on second run
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Indexing
Indexing, Bulk Indexing and anything related to indexing
skip-changelog
Storage
Issues and PRs relating to data and metadata storage
>test-failure
Test failure from CI, local build, etc.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This bug was found from analyzing flakiness of existing ITs using remote store where replicas would not be caught up with the primary. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution that update the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos but only if necessary.
I believe this is the fix for all SegmentReplicationUsingRemoteStore IT flakies where replica count is not high enough.
Related Issues
Resolves #9712
Resolves #10025
Resolves #10026
Resolves #8762
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.