-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix flaky integration test of ShardsLimitAllocationDeciderRemoteStoreEnabledIT #18236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Divyansh Pandey <dpaandey@amazon.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #18236 +/- ##
============================================
- Coverage 72.56% 72.46% -0.11%
+ Complexity 67261 67213 -48
============================================
Files 5476 5476
Lines 310478 310487 +9
Branches 45133 45133
============================================
- Hits 225313 225005 -308
- Misses 66840 67189 +349
+ Partials 18325 18293 -32 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
❌ Gradle check result for 6de3633: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @pandeydivyansh1803!
...rch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderRemoteStoreEnabledIT.java
Outdated
Show resolved
Hide resolved
...rch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderRemoteStoreEnabledIT.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 6de3633: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Divyansh Pandey <dpaandey@amazon.com>
|
❕ Gradle check result for 18d8862: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
…EnabledIT (opensearch-project#18236) Signed-off-by: Divyansh Pandey <dpaandey@amazon.com> Co-authored-by: Divyansh Pandey <dpaandey@amazon.com>
…EnabledIT (opensearch-project#18236) Signed-off-by: Divyansh Pandey <dpaandey@amazon.com> Co-authored-by: Divyansh Pandey <dpaandey@amazon.com>
…EnabledIT (opensearch-project#18236) Signed-off-by: Divyansh Pandey <dpaandey@amazon.com> Co-authored-by: Divyansh Pandey <dpaandey@amazon.com>
…EnabledIT (opensearch-project#18236) Signed-off-by: Divyansh Pandey <dpaandey@amazon.com> Co-authored-by: Divyansh Pandey <dpaandey@amazon.com>Signed-off-by: TJ Neuenfeldt <tjneu@amazon.com>
…EnabledIT (opensearch-project#18236) Signed-off-by: Divyansh Pandey <dpaandey@amazon.com> Co-authored-by: Divyansh Pandey <dpaandey@amazon.com>
Description
Fix flaky ShardsLimitAllocationDeciderRemoteStoreEnabledIT by avoiding race condition for remote backed clusters.
Problem:
Integration test for ShardsLimitAllocationDeciderRemoteStoreEnabledIT was failing intermittently due to an
AlreadyClosedExceptionin multiple threads. The exception occurred because the Directory was being closed while there were still ongoing segment upload operations to the remote store. Specifically, during the tearDown process, when the test was attempting to clean up indices, some threads were still trying to access the Directory to upload segments, but the Directory had already been closed. This race condition manifested as anAlreadyClosedExceptionwith the message "this Directory is closed" being thrown from theBaseDirectory.ensureOpen()method. The issue occurred in theRemoteDirectory.uploadBloboperation, which was attempting to read from a closed Directory while syncing segments to the remote store.Solution:
The solution is implemented through a robust cleanup mechanism in the form of a
cleanUpmethod. This method ensures proper synchronization of remote store operations before test teardown by executing a forced flush operation on all indices created during the test. The implementation uses a FlushRequest with force(true) to guarantee flush execution even without pending changes, andwaitIfOngoing(true) to ensure completion of any concurrent flush operations. This synchronization step is crucial as it allows all segment uploads to complete before the Directory is closed, effectively preventing theAlreadyClosedExceptionthat was previously occurring due to race conditions. This change has proven stable across 2000+ test iterations.Related Issues
Resolves #17931
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.