Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where retries within RemoteStoreRefreshListener cause infos/checkpoint mismatch #10655

Merged
merged 2 commits into from
Oct 19, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Oct 16, 2023

Description

This bug was found from analyzing flakiness of existing ITs using remote store where replicas would not be caught up with the primary. Retries within RemoteStoreRefreshListener run outside of the refresh thread. This means that concurrent refreshes may occur during syncSegments execution that update the on-reader SegmentInfos. A shard's latest ReplicationCheckpoint is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently. This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos but only if necessary.

I believe this is the fix for all SegmentReplicationUsingRemoteStore IT flakies where replica count is not high enough.

Related Issues

Resolves #9712
Resolves #10025
Resolves #10026
Resolves #8762

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage Issues and PRs relating to data and metadata storage labels Oct 16, 2023
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

#10730

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

#10730 again

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@gbbafna
Copy link
Collaborator

gbbafna commented Oct 19, 2023


* What went wrong:
Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
> Building 2.11.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.11/distribution/archives/linux-tar/build/distributions/opensearch-min-2.11.0-SNAPSHOT-linux-x64.tar.gz

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

#10154

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

mixed cluster bwc tests failing, believe this will pass once #10754 is merged and rebased

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

…h between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update refresh listener to recompute checkpoint from latest infos snapshot.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix broken test case by comparing segments gen

Signed-off-by: Marc Handalian <handalm@amazon.com>

spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix RemoteStoreRefreshListener tests

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 added the backport 2.x Backport to 2.x branch label Oct 19, 2023
@mch2 mch2 merged commit e389a09 into opensearch-project:main Oct 19, 2023
17 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 19, 2023
…heckpoint mismatch (#10655)

* Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update refresh listener to recompute checkpoint from latest infos snapshot.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix broken test case by comparing segments gen

Signed-off-by: Marc Handalian <handalm@amazon.com>

spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix RemoteStoreRefreshListener tests

Signed-off-by: Marc Handalian <handalm@amazon.com>

* add extra log

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit e389a09)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@mch2 mch2 deleted the retryfix branch October 19, 2023 21:40
mch2 pushed a commit that referenced this pull request Oct 19, 2023
…heckpoint mismatch (#10655) (#10760)

* Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.



Update refresh listener to recompute checkpoint from latest infos snapshot.



Fix broken test case by comparing segments gen



spotless



Fix RemoteStoreRefreshListener tests



* add extra log



---------


(cherry picked from commit e389a09)

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Oct 23, 2023
…heckpoint mismatch (opensearch-project#10655)

* Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update refresh listener to recompute checkpoint from latest infos snapshot.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix broken test case by comparing segments gen

Signed-off-by: Marc Handalian <handalm@amazon.com>

spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix RemoteStoreRefreshListener tests

Signed-off-by: Marc Handalian <handalm@amazon.com>

* add extra log

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…heckpoint mismatch (opensearch-project#10655)

* Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update refresh listener to recompute checkpoint from latest infos snapshot.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix broken test case by comparing segments gen

Signed-off-by: Marc Handalian <handalm@amazon.com>

spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix RemoteStoreRefreshListener tests

Signed-off-by: Marc Handalian <handalm@amazon.com>

* add extra log

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing skip-changelog Storage Issues and PRs relating to data and metadata storage >test-failure Test failure from CI, local build, etc.
Projects
None yet
6 participants