Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix flaky test org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections #4212

Closed
reta opened this issue Aug 15, 2022 · 8 comments · Fixed by #5439
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@reta
Copy link
Collaborator

reta commented Aug 15, 2022

Describe the bug
New flaky test org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections , spotted in [1]

java.lang.AssertionError: expected null, but was:<org.opensearch.index.stats.IndexingPressurePerShardStats@5eae0090>
	at __randomizedtesting.SeedInfo.seed([9EFE9529CF42C133:9166A1D8EFC3EFA6]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotNull(Assert.java:756)
	at org.junit.Assert.assertNull(Assert.java:738)
	at org.junit.Assert.assertNull(Assert.java:748)
	at org.opensearch.index.ShardIndexingPressureConcurrentE

[1] https://build.ci.opensearch.org/job/gradle-check/1711/testReport/junit/org.opensearch.index/ShardIndexingPressureConcurrentExecutionTests/testReplicaThreadedUpdateToShardLimitsAndRejections/

To Reproduce
Steps to reproduce the behavior:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections" -Dtests.seed=9EFE9529CF42C133 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr-LU -Dtests.timezone=America/Goose_Bay -Druntime.java=17

Expected behavior
Test should pass

Plugins
Standard

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: CI
  • Version main

Additional context
Add any other context about the problem here.

@reta reta added bug Something isn't working flaky-test Random test failure that succeeds on second run labels Aug 15, 2022
@dblock
Copy link
Member

dblock commented Nov 8, 2022

#3633

@dblock
Copy link
Member

dblock commented Nov 18, 2022

#5283 (comment)

@dblock
Copy link
Member

dblock commented Nov 21, 2022

#5069 (comment)

@dblock
Copy link
Member

dblock commented Nov 21, 2022

This was added in #1336, @getsaurabh02 care to take a look pls?

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Nov 23, 2022

Looked into this test a bit. The test fails only when rejection count if full (this happens randomly) i.e rejection count is equal to NUM_THREADS. Two possible workaround/Questions:
-> Can rejection count be allowed to be equal to NUM_THREADS? is this something we should allow happening in test? or
-> Should we update logic of asserting to null when all the releasable's are null(happens when rejection count is equal to NUM_THREADS).

Updating IntegTest after line with below code, fixes the issue. But not sure if this is right way forward. @getsaurabh02 can you please give your thoughts on this.

if(rejectionCount.get() == NUM_THREADS){
            assertEquals(15, shardStoreStats.getCurrentReplicaLimits());
        }
        else{
            assertNull(shardStoreStats);
        }

@dblock
Copy link
Member

dblock commented Dec 1, 2022

#4587 (comment)

@getsaurabh02
Copy link
Member

Thanks for sharing the info @Rishikesh1159. I agree with the adding the optional null check as above, since the shardStoreStats objects move from hot to cold store in case they are not updated in a while.

Here, if rejection count equals to the NUM_THREADS which means rejections happened until the last request, the shardIndexingPressure.shardStats().getIndexingPressureShardStats(shardId1) will return the shardStoreStats which was updated on the last request. In other cases, the shardStoreStats simply moves to the cold store and is returned null.

@Rishikesh1159
Copy link
Member

Thanks @getsaurabh02 for your detail explanation. What you said makes sense. I have put out a small PR to fix this behavior of test by adding the optional null check as we discussed. Can you please review the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
4 participants