Skip to content

Concurrent Snapshot Create+Delete+ Index Deletion for a Partial Snapshot can Lead to Deadlock in Corner Case #61762

Closed
@original-brownbear

Description

@original-brownbear

There appears to be a corner case where running concurrent snapshot create+delete operations for partial snapshots while also removing indices used by those snapshots from the cluster concurrently can lead to a null value being set for a certain shard's snapshot status leading to the following exception:

instance-xxxxx] [found-snapshots] failed to delete snapshot [cloud-snapshot-xxxx/xxxxx] for retention
org.elasticsearch.repositories.RepositoryException: [_all] Failed to update cluster state during repository operation
	at org.elasticsearch.snapshots.SnapshotsService.failAllListenersOnMasterFailOver(SnapshotsService.java:2046) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:120) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.onFailure(SnapshotsService.java:2105) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:513) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$publishingFailed$0(MasterService.java:417) [elasticsearch-7.9.0.jar:7.9.0]
	at java.util.ArrayList.forEach(ArrayList.java:1510) [?:?]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.publishingFailed(MasterService.java:417) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService.onPublicationFailed(MasterService.java:301) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:275) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.0.jar:7.9.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publishing failed
	at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1119) ~[elasticsearch-7.9.0.jar:7.9.0]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) ~[elasticsearch-7.9.0.jar:7.9.0]
	... 11 more

I know how this is coming about and will open a fix PR shortly. It's a very unlikely corner case but it seems to have hit Cloud in one instance at least.

Metadata

Metadata

Labels

:Distributed Coordination/Snapshot/RestoreAnything directly related to the `_snapshot/*` APIs>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.needs:triageRequires assignment of a team area label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions