Description
Describe the bug
Test org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
is flaky
To Reproduce
سبت 11, 2023 1:44:23 م com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#339,opensearch[node_t2][clusterApplierService#updateTask][T#1],5,TGRP-MinimumClusterManagerNodesIT]
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][2], node[IADuWGkCTpuWEnWUFcbkSQ], [P], s[STARTED], a[id=oar4Dv6STMWSzO-FDH4bMA]
at __randomizedtesting.SeedInfo.seed([7E7C985F304948B0]:0)
at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:752)
at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710)
at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650)
at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1623)
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
NOTE: leaving temporary files on disk at: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.cluster.MinimumClusterManagerNodesIT_7E7C985F304948B0-001
NOTE: test params are: codec=Asserting(Lucene95), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=ar-SD, timezone=Europe/Lisbon
NOTE: Linux 5.15.0-1039-aws amd64/Eclipse Adoptium 20.0.2 (64-bit)/cpus=32,threads=1,free=204825744,total=536870912
NOTE: All tests run in this JVM: [PendingTasksBlocksIT, GetIndexIT, ActiveShardsObserverIT, MinimumClusterManagerNodesIT]
Expected behavior
Test should always pass
Plugins
Standard
Screenshots
Host/Environment (please complete the following information):
https://build.ci.opensearch.org/job/gradle-check/25287/testReport/junit/org.opensearch.cluster/MinimumClusterManagerNodesIT/testThreeNodesNoClusterManagerBlock/
Additional context
https://build.ci.opensearch.org/job/gradle-check/25287/
I (@andrross) am adding the content from this comment to the description here because it has now been buried in the comment stream:
I believe I have traced this back to the commit that introduced the flakiness: 9119b6d (#9105)
The following command will reliably reproduce the failure for me:
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100
If I select the commit immediately preceding 9119b6d then it does not reproduce.
This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.
Metadata
Assignees
Type
Projects
Status
✅ Done
Activity