Closed
Description
There appears to be a corner case where running concurrent snapshot create+delete operations for partial snapshots while also removing indices used by those snapshots from the cluster concurrently can lead to a null
value being set for a certain shard's snapshot status leading to the following exception:
instance-xxxxx] [found-snapshots] failed to delete snapshot [cloud-snapshot-xxxx/xxxxx] for retention
org.elasticsearch.repositories.RepositoryException: [_all] Failed to update cluster state during repository operation
at org.elasticsearch.snapshots.SnapshotsService.failAllListenersOnMasterFailOver(SnapshotsService.java:2046) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:120) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.onFailure(SnapshotsService.java:2105) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:513) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$publishingFailed$0(MasterService.java:417) [elasticsearch-7.9.0.jar:7.9.0]
at java.util.ArrayList.forEach(ArrayList.java:1510) [?:?]
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.publishingFailed(MasterService.java:417) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService.onPublicationFailed(MasterService.java:301) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:275) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.0.jar:7.9.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publishing failed
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1119) ~[elasticsearch-7.9.0.jar:7.9.0]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) ~[elasticsearch-7.9.0.jar:7.9.0]
... 11 more
I know how this is coming about and will open a fix PR shortly. It's a very unlikely corner case but it seems to have hit Cloud in one instance at least.