Skip to content

Make snapshots more resilient to node departure #71333

Open
@DaveCTurner

Description

@DaveCTurner

If a node holding a primary shard leaves the cluster then one of the replica shards is immediately promoted to primary to replace the failed copy. Today if there was a snapshot ongoing when the promotion happens then the corresponding shard-level snapshot fails and the overall snapshot status is at best PARTIAL. This is a problem for graceful shutdowns (#70338), which ideally would not result in any such failures. In cases where a replica is promoted to replace a failed primary it would be better instead to retry the shard-level snapshot on the new primary.

This isn't the first time this idea has come up:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions