Skip to content

Fix Concurrent Snapshot Create+Delete + Delete Index #61770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 1, 2020

Conversation

original-brownbear
Copy link
Contributor

We had a bug here that is new to 7.9 were we put a null value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes #61762

We had a bug here were we put a `null` value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes elastic#61762
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 1, 2020
@@ -1715,8 +1723,7 @@ public static ClusterState updateWithSnapshots(ClusterState state,
IndexMetadata indexMetadata = metadata.index(indexName);
if (indexMetadata == null) {
// The index was deleted before we managed to start the snapshot - mark it as missing.
builder.put(new ShardId(indexName, IndexMetadata.INDEX_UUID_NA_VALUE, 0),
new SnapshotsInProgress.ShardSnapshotStatus(null, ShardState.MISSING, "missing index", null));
builder.put(new ShardId(indexName, IndexMetadata.INDEX_UUID_NA_VALUE, 0), ShardSnapshotStatus.MISSING);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before concurrent snapshots this spot would cover all possible scenarios because we'd only be dealing with shard ids for indices that still exist in the repo ever beyond this point. If an index was deleted after assignment then it would just fail in the SnapshotShardsService and things would work out that way.
But with concurrent snapshots where we could have indices deleted from under a queued up shard snapshot we have to explicitly deal with this situation.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit ca00a6f into elastic:master Sep 1, 2020
@original-brownbear original-brownbear deleted the 61762 branch September 1, 2020 10:28
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 1, 2020
We had a bug here were we put a `null` value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes elastic#61762
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 1, 2020
We had a bug here were we put a `null` value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes elastic#61762
original-brownbear added a commit that referenced this pull request Sep 1, 2020
We had a bug here were we put a `null` value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes #61762
original-brownbear added a commit that referenced this pull request Sep 1, 2020
We had a bug here were we put a `null` value into the shard
assignment mapping when reassigning work after a snapshot delete
had gone through. This only affects partial snaphots but essentially
dead-locks the snapshot process.

Closes #61762
@original-brownbear original-brownbear restored the 61762 branch December 6, 2020 18:53
@DaveCTurner
Copy link
Contributor

Linking this to #56911 so that I can find it again in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.9.1 v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrent Snapshot Create+Delete+ Index Deletion for a Partial Snapshot can Lead to Deadlock in Corner Case
5 participants