Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test RecoveryFromGatewayIT.testMultipleReplicaShardAssignme… #14424

Merged

Conversation

SwethaGuptha
Copy link
Contributor

@SwethaGuptha SwethaGuptha commented Jun 18, 2024

Description

This test case validates the shard allocation behavior for unassigned shards in batch mode with delayed shard assignment. The unassigned status of the shards should be ALLOCATION_DELAYED if any node leaves the cluster for the duration configured in setting EXISTING_SHARDS_ALLOCATOR_BATCH_MODE.

Flaky test error:

java.lang.AssertionError: expected:<allocation_delayed> but was:<awaiting_info>
at __randomizedtesting.SeedInfo.seed([F6064F03567011EA:1E932282AEA80D9A]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.gateway.RecoveryFromGatewayIT.testMultipleReplicaShardAssignmentWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode(RecoveryFromGatewayIT.java:921)

In the code block, the test case is currently flaky for two reasons:

https://github.com/opensearch-project/OpenSearch/blob/3a0c0c0b38c0b42bc519c3673d5cd4a1e3379550/server/src/internalClusterTest/java/org/opensearch/gateway/RecoveryFromGatewayIT.java#L905-L924C9

  • The reason unassigned shard from allocation explain API response is AWAITING_INFO for few runs when the expected status is ALLOCATION_DELAYED. AWAITING_INFO is also a valid status when master node is still performing fetch data. Hence adding a waitUntil loop for ALLOCATION_DELAYED with a timeout of 2mins as a fix. - Main reason for flakiness
  • The test case can also become flaky at L#906 because it might take sometime for unassigned shards for the re-started node to be started, hence added a wait here too.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

#14304

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 796e5a8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@SwethaGuptha SwethaGuptha force-pushed the opensearch-bug-fix-branch branch from 796e5a8 to bee6a29 Compare June 18, 2024 15:26
Copy link
Contributor

❌ Gradle check result for bee6a29: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
@SwethaGuptha SwethaGuptha force-pushed the opensearch-bug-fix-branch branch from bee6a29 to 10e4489 Compare June 18, 2024 16:21
Copy link
Contributor

✅ Gradle check result for 10e4489: SUCCESS

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.63%. Comparing base (b15cb0c) to head (10e4489).
Report is 448 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14424      +/-   ##
============================================
+ Coverage     71.42%   71.63%   +0.21%     
- Complexity    59978    62052    +2074     
============================================
  Files          4985     5118     +133     
  Lines        282275   291833    +9558     
  Branches      40946    42180    +1234     
============================================
+ Hits         201603   209069    +7466     
- Misses        63999    65524    +1525     
- Partials      16673    17240     +567     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrross andrross added the backport 2.x Backport to 2.x branch label Jun 18, 2024
@andrross andrross merged commit 802f2e6 into opensearch-project:main Jun 18, 2024
36 of 37 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 18, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (#14424)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
(cherry picked from commit 802f2e6)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
dblock pushed a commit that referenced this pull request Jun 18, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (#14424) (#14432)

(cherry picked from commit 802f2e6)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
SwethaGuptha added a commit to SwethaGuptha/OpenSearch that referenced this pull request Jun 20, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (opensearch-project#14424)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Jul 12, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (opensearch-project#14424)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this pull request Jul 24, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (opensearch-project#14424) (opensearch-project#14432)

(cherry picked from commit 802f2e6)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
Signed-off-by: kkewwei <kkewwei@163.com>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
…ntWithDelayedAllocationAndDifferentNodeStartTimeInBatchMode (opensearch-project#14424)

Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants