Description
This problem occurred while running indices.update_aliases/10_basic/Basic test for multiple aliases
in the qa/smoke-test-multinode
directory. However, I do not think that particular test is to blame for the failure - it seems more like a very rare problem with index management in multi-node clusters.
The build where this problem was seen was https://gradle-enterprise.elastic.co/s/fknt4cque3p4e
The client side problem that made the test fail was "index already exists":
1> [2019-08-15T18:29:39,851][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] before test
1> [2019-08-15T18:30:09,895][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
1> "stash" : {
1> "body" : {
1> "error" : {
1> "root_cause" : [
1> {
1> "type" : "resource_already_exists_exception",
1> "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
1> "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
1> "index" : "test_index",
1> "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
1> at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
1> at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
1> at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
1> at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
1> at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
1> at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
1> at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
1> at java.lang.Thread.run(Thread.java:834)
1> "
1> }
1> ],
1> "type" : "resource_already_exists_exception",
1> "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
1> "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
1> "index" : "test_index",
1> "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
1> at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
1> at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
1> at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
1> at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
1> at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
1> at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
1> at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
1> at java.lang.Thread.run(Thread.java:834)
1> "
1> },
1> "status" : 400
1> }
1> }
1> }]
1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery].
1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] after test
The underlying reason from the server side logs was that a previous test's index of the same name could not be deleted:
[2019-08-15T08:28:37,221][WARN ][o.e.i.IndicesService ] [integTest-0] [test_index/THgD8CcvQyGEf_ADVnVoCw] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [test_index][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-08-15T08:28:37,240][WARN ][o.e.i.s.RetentionLeaseSyncAction] [integTest-0] unexpected error during the primary phase for action [indices:admin/seq_no/retention_lease_sync], request [RetentionLeaseSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=2, leases={peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA=RetentionLease{id='peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA', retainingSequenceNumber=0, timestamp=1565857716949, source='peer recovery'}, peer_recovery/bnbl33BbQveq_mL2_DzTcg=RetentionLease{id='peer_recovery/bnbl33BbQveq_mL2_DzTcg', retainingSequenceNumber=0, timestamp=1565857717149, source='peer recovery'}}}, shardId=[test_index][0], timeout=1m, index='test_index', waitForActiveShards=0}]
org.elasticsearch.index.IndexNotFoundException: no such index [test_index]
at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:190) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:116) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteSingleIndex(IndexNameExpressionResolver.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.action.support.replication.TransportReplicationAction.concreteIndex(TransportReplicationAction.java:234) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:651) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:795) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.postAdded(ClusterStateObserver.java:220) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService$1.run(ClusterApplierService.java:299) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
The repro command is:
./gradlew :qa:smoke-test-multinode:integTestRunner --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" \
-Dtests.method="test {yaml=indices.update_aliases/10_basic/Basic test for multiple aliases}" \
-Dtests.seed=8C4DDB5E411196EE \
-Dtests.security.manager=true \
-Dtests.locale=fr-ML \
-Dtests.timezone=Australia/ACT \
-Dcompiler.java=12 \
-Druntime.java=11
This did not reproduce for me.
The other thing that might be relevant is that around the time this was failing a completely different test against a different test cluster failed on the same CI worker. It was:
./gradlew :docs:integTestRunner --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/docs/update-by-query/line_253}" \
-Dtests.seed=8C4DDB5E411196EE \
-Dtests.security.manager=true \
-Dtests.locale=dyo \
-Dtests.timezone=Asia/Jerusalem \
-Dcompiler.java=12 \
-Druntime.java=11
The client side error for that was:
1> [2019-08-15T11:29:39,823][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] before test
1> [2019-08-15T11:30:09,868][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] Stash dump on test failure [{
1> "stash" : {
1> "body" : {
1> "error" : {
1> "root_cause" : [
1> {
1> "type" : "resource_already_exists_exception",
1> "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
1> "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
1> "index" : "twitter",
1> "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
1> at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
1> at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
1> at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
1> at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
1> at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
1> at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
1> at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
1> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
1> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
1> at java.base/java.lang.Thread.run(Thread.java:834)
1> "
1> }
1> ],
1> "type" : "resource_already_exists_exception",
1> "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
1> "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
1> "index" : "twitter",
1> "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
1> at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
1> at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
1> at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
1> at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
1> at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
1> at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
1> at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
1> at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
1> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
1> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
1> at java.base/java.lang.Thread.run(Thread.java:834)
1> "
1> },
1> "status" : 400
1> }
1> }
1> }]
1> [2019-08-15T11:30:36,334][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] after test
The server side for that docs test failure shows nothing apart from a big gap:
[2019-08-15T08:29:39,832][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [twitter] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-08-15T08:30:24,229][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [twitter/I-JDSTD4S6KmKfAEkI95iA] deleting index
These two test clusters were different - it's not that the docs tests and the multi-node smoke tests were simultaneously talking to the same test cluster. But they were running at the same time and suffered failures around the same time.
It may be that these failures are caused because CI worker was just completely overloaded by multiple test clusters running simultaneously around the time of these failures.
But maybe there's something interesting in the ShardLockObtainFailedException
that might be worth considering in case users see it on machines that are under heavy load.