[BUG] AwarenessAllocationIT fail due to inconsistent shard count in response #7401

peternied · 2023-05-03T20:33:33Z

Describe the bug
org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

java.lang.AssertionError: 
Expected: <120>
     but: was <119>

java.lang.AssertionError: 
Expected: <120>
     but: was <119>
	at __randomizedtesting.SeedInfo.seed([8EE4D794D6758BA2:A5DDBA6063373C3B]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:964)
	at org.junit.Assert.assertThat(Assert.java:930)
	at org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness(AwarenessAllocationIT.java:502)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Impacted PR [Backport main] Add bwc version 2.7.1 (#7371) #7379

The text was updated successfully, but these errors were encountered:

peternied · 2023-05-03T20:34:03Z

I'll pick this up and resolve it by EOW next week or unassign myself.

imRishN · 2024-01-05T09:49:41Z

Investigated this issue. OpenSearch in a way follows greedy approach while allocating shards and doesn't compute the optimal allocation for all the shards that needs to be allocated. This approach based on certain filters and rules tries to control nodes where shards are assigned.

The unassigned shards causing test failure is due to the same above reason where a node where the shard was supposed to be assigned created a conflict with the awareness allocation decider. Hence, it is stuck in a state, waiting for space to allocate the unassigned shard because it cannot assign it to the only node with space. This also seemed more likely to happen in this particular test case because it is creating a 15 nodes cluster and over 120 shards which increases the probability of landing up in such a case. A smaller cluster with lesser shards would be less likely to land up in such a case.

This also seem to be a known issue after scrolling open issues in ElasticSearch/OpenSearch

imRishN · 2024-01-08T07:04:38Z

Closing the issue, test has been muted and will be fixed and enabled back after the fix in allocator

peternied · 2024-01-08T14:31:26Z

This issue isn't fixed - the test is disabled. If we were to delete the test I'd be happy to close this issue; however, I suspect that we want to fix the underlying issue. If there is another issue tracking the scroll issue please link it here.

imRishN · 2024-01-08T16:32:04Z

@peternied, the merged PR which is muting the test links it to actual underlying issue. This is the merged PR muting the test - #11767. And the PR links the test to actual issue - #5908.

Feel free to close the issue if this suffices, else we can keep this open

andrross · 2024-02-21T20:50:02Z

Resolving as test has been muted and points the #5908 as the cause that needs to be fixed

peternied added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels May 3, 2023

peternied self-assigned this May 3, 2023

jed326 mentioned this issue May 10, 2023

Change INDEX_SEARCHER threadpool to resizable to support task resource tracking #7502

Merged

6 tasks

dbwiddis removed the untriaged label May 12, 2023

peternied removed their assignment Sep 1, 2023

RamakrishnaChilaka assigned imRishN Oct 24, 2023

This was referenced Jan 5, 2024

Mute flaky testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #11767

Merged

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

Open

imRishN closed this as completed Jan 8, 2024

peternied reopened this Jan 8, 2024

github-actions bot added the untriaged label Jan 8, 2024

anasalkouz added the Other label Jan 11, 2024

vikasvb90 added Cluster Manager and removed Cluster Manager labels Jan 30, 2024

andrross closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AwarenessAllocationIT fail due to inconsistent shard count in response #7401

[BUG] AwarenessAllocationIT fail due to inconsistent shard count in response #7401

peternied commented May 3, 2023

peternied commented May 3, 2023 •

edited

Loading

imRishN commented Jan 5, 2024 •

edited

Loading

imRishN commented Jan 8, 2024 •

edited

Loading

peternied commented Jan 8, 2024

imRishN commented Jan 8, 2024

andrross commented Feb 21, 2024

[BUG] AwarenessAllocationIT fail due to inconsistent shard count in response #7401

[BUG] AwarenessAllocationIT fail due to inconsistent shard count in response #7401

Comments

peternied commented May 3, 2023

peternied commented May 3, 2023 • edited Loading

imRishN commented Jan 5, 2024 • edited Loading

imRishN commented Jan 8, 2024 • edited Loading

peternied commented Jan 8, 2024

imRishN commented Jan 8, 2024

andrross commented Feb 21, 2024

peternied commented May 3, 2023 •

edited

Loading

imRishN commented Jan 5, 2024 •

edited

Loading

imRishN commented Jan 8, 2024 •

edited

Loading