Skip to content

[CI] resource_already_exists_exception failures #45605

Closed
@droberts195

Description

@droberts195

This problem occurred while running indices.update_aliases/10_basic/Basic test for multiple aliases in the qa/smoke-test-multinode directory. However, I do not think that particular test is to blame for the failure - it seems more like a very rare problem with index management in multi-node clusters.

The build where this problem was seen was https://gradle-enterprise.elastic.co/s/fknt4cque3p4e

The client side problem that made the test fail was "index already exists":

  1> [2019-08-15T18:29:39,851][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] before test
  1> [2019-08-15T18:30:09,895][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>             "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>             "index" : "test_index",
  1>             "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>         "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>         "index" : "test_index",
  1>         "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery].
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] after test

The underlying reason from the server side logs was that a previous test's index of the same name could not be deleted:

[2019-08-15T08:28:37,221][WARN ][o.e.i.IndicesService     ] [integTest-0] [test_index/THgD8CcvQyGEf_ADVnVoCw] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [test_index][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-08-15T08:28:37,240][WARN ][o.e.i.s.RetentionLeaseSyncAction] [integTest-0] unexpected error during the primary phase for action [indices:admin/seq_no/retention_lease_sync], request [RetentionLeaseSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=2, leases={peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA=RetentionLease{id='peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA', retainingSequenceNumber=0, timestamp=1565857716949, source='peer recovery'}, peer_recovery/bnbl33BbQveq_mL2_DzTcg=RetentionLease{id='peer_recovery/bnbl33BbQveq_mL2_DzTcg', retainingSequenceNumber=0, timestamp=1565857717149, source='peer recovery'}}}, shardId=[test_index][0], timeout=1m, index='test_index', waitForActiveShards=0}]
org.elasticsearch.index.IndexNotFoundException: no such index [test_index]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:190) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:116) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteSingleIndex(IndexNameExpressionResolver.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.concreteIndex(TransportReplicationAction.java:234) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:651) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:795) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.postAdded(ClusterStateObserver.java:220) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$1.run(ClusterApplierService.java:299) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The repro command is:

./gradlew :qa:smoke-test-multinode:integTestRunner --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" \
  -Dtests.method="test {yaml=indices.update_aliases/10_basic/Basic test for multiple aliases}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-ML \
  -Dtests.timezone=Australia/ACT \
  -Dcompiler.java=12 \
  -Druntime.java=11

This did not reproduce for me.

The other thing that might be relevant is that around the time this was failing a completely different test against a different test cluster failed on the same CI worker. It was:

./gradlew :docs:integTestRunner --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/docs/update-by-query/line_253}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=dyo \
  -Dtests.timezone=Asia/Jerusalem \
  -Dcompiler.java=12 \
  -Druntime.java=11

The client side error for that was:

  1> [2019-08-15T11:29:39,823][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] before test
  1> [2019-08-15T11:30:09,868][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>             "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>             "index" : "twitter",
  1>             "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>         "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>         "index" : "twitter",
  1>         "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T11:30:36,334][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] after test

The server side for that docs test failure shows nothing apart from a big gap:

[2019-08-15T08:29:39,832][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [twitter] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-08-15T08:30:24,229][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [twitter/I-JDSTD4S6KmKfAEkI95iA] deleting index

These two test clusters were different - it's not that the docs tests and the multi-node smoke tests were simultaneously talking to the same test cluster. But they were running at the same time and suffered failures around the same time.

It may be that these failures are caused because CI worker was just completely overloaded by multiple test clusters running simultaneously around the time of these failures.

But maybe there's something interesting in the ShardLockObtainFailedException that might be worth considering in case users see it on machines that are under heavy load.

Metadata

Metadata

Labels

:Delivery/BuildBuild or test infrastructure:Distributed Indexing/DistributedA catch all label for anything in the Distributed Indexing Area. Please avoid if you can.>test-failureTriaged test failures from CITeam:DeliveryMeta label for Delivery team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions