Enable Fully Concurrent Snapshot Operations (#56911) #59578

original-brownbear · 2020-07-14T23:34:50Z

Enables fully concurrent snapshot operations:

Snapshot create- and delete operations can be started in any order
Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed
- We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one))
Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations

See updated JavaDoc and added test cases for more details and illustration on the functionality.

Some notes:

The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the SNAPSHOT pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the SnapshotService state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up.

This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots.

Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first

backport of #56911

Enables fully concurrent snapshot operations: * Snapshot create- and delete operations can be started in any order * Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed * We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one)) * Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations See updated JavaDoc and added test cases for more details and illustration on the functionality. Some notes: The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the `SNAPSHOT` pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the `SnapshotService` state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up. This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots. Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first

elasticmachine · 2020-07-14T23:34:52Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

Mute BwC tests so that elastic#59578 can be merged.

Mute BwC tests so that #59578 can be merged.

Reenabling BwC Tests now that elastic#59578 is merged.

Reenabling BwC Tests now that #59578 is merged.

original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Jul 14, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 14, 2020

fixes

cc8d08b

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 15, 2020

Mute BwC Tests for elastic#59578

3b0b4a8

Mute BwC tests so that elastic#59578 can be merged.

original-brownbear mentioned this pull request Jul 15, 2020

Mute BwC Tests for #59578 #59579

Merged

fixes

05e5601

original-brownbear added a commit that referenced this pull request Jul 15, 2020

Mute BwC Tests for #59578 (#59579)

595c9c1

Mute BwC tests so that #59578 can be merged.

original-brownbear merged commit 2dd0864 into elastic:7.x Jul 15, 2020

original-brownbear deleted the 56911-7.x branch July 15, 2020 01:42

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 15, 2020

Reenable BwC Tests After elastic#59578

e0aea84

Reenabling BwC Tests now that elastic#59578 is merged.

original-brownbear mentioned this pull request Jul 15, 2020

Reenable BwC Tests After #59578 #59582

Merged

original-brownbear added a commit that referenced this pull request Jul 15, 2020

Reenable BwC Tests After #59578 (#59582)

687623b

Reenabling BwC Tests now that #59578 is merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Fully Concurrent Snapshot Operations (#56911) #59578

Enable Fully Concurrent Snapshot Operations (#56911) #59578

Uh oh!

original-brownbear commented Jul 14, 2020

Uh oh!

elasticmachine commented Jul 14, 2020

Uh oh!

Uh oh!

Enable Fully Concurrent Snapshot Operations (#56911) #59578

Enable Fully Concurrent Snapshot Operations (#56911) #59578

Uh oh!

Conversation

original-brownbear commented Jul 14, 2020

Uh oh!

elasticmachine commented Jul 14, 2020

Uh oh!

Uh oh!