Skip to content

Stop Setting "Completed" States on Snapshots in CS #54433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

original-brownbear
Copy link
Contributor

SnapshotsService will end snapshots once all their shards are in a final state, regardless of their state stored in the SnapshotsInProgress.Entry.
There is no point in setting SUCCESS when that happens. All that does is create
the strange situation where a snapshot shows as SUCCESS in snapshot status APIs
when its not yet done and loses the information as to whether or not a snapshot
was aborted.

This change in the state machine is fully BwC and enables a smarter snapshot abort logic in a follow-up that does not need to finalize an aborted snapshot only to then delete it again (keeping this in a separate PR since it's BwC and smarter abort logic wouldn't be).

`SnapshotsService` will end snapshots once all their shards are in a final state.
There is no point in setting `SUCCESS` when that happens. All that does is create
the strange situation where a snapshot shows as `SUCCESS` in snapshot status APIs
when its not yet done and loses the information as to whether or not a snapshot
was aborted.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear changed the title Stop Setting completed States on Snapshots in CS Stop Setting "Completed" States on Snapshots in CS Mar 30, 2020
@original-brownbear
Copy link
Contributor Author

Jenkins test this

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/2 (unrelated security fail)

@ywelsch
Copy link
Contributor

ywelsch commented Apr 1, 2020

I wonder if we should introduce a new state called FINALIZE which shows that all shard-level actions completed (and snapshot was not aborted, where we would otherwise keep the ABORTED state), and that only finalization is needed. This makes it easier at the high level to track the transitions. I'm wondering why this change here works, as endSnapshot is triggered by checking entry.state().completed(). Doing the full scan of completed(entry.shards().values()) in SnapshotsService.applyClusterState is much more expensive, especially if this needs to be done on every CS update that is happening while snapshot is pending completion.

@original-brownbear
Copy link
Contributor Author

@ywelsch

Doing the full scan of completed(entry.shards().values()) in SnapshotsService.applyClusterState is much more expensive, especially if this needs to be done on every CS update

This already happens anyway. The only situation that gets more expensive is when the snapshot actually finished, because that short circuits the condition we use for finalization:

entry.state().completed()
                            || initializingSnapshots.contains(entry.snapshot()) == false
                               && (entry.state() == State.INIT || completed(entry.shards().values()))

That said, I pushed f62c701 which exploits our logic that keeps track of what snapshots are currently being finalization to not do the check for snapshots we're already finalizing so now it's always cheap :)

I'm wondering why this change here works

See above, we're doing the check already anyway.

This makes it easier at the high level to track the transitions

I'd rather not do that for now to be honest, just a lot of added complexity when we'll have some BwC related changes for concurrent snapshots anyway? (this also changes the API returns unless we add some logic to keep the output the same so we can't easily backport the change to 7.x I guess)

@original-brownbear
Copy link
Contributor Author

@ywelsch could you take another look here if you have a sec? Should be a quick one :)
Having this in, would make some other work easier (and apparently stabilize some ILM/SLM integ-tests). Thanks!

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@ywelsch ywelsch removed their request for review July 23, 2020 07:26
@andreidan andreidan removed the v7.10.0 label Oct 7, 2020
@original-brownbear
Copy link
Contributor Author

Closing here, this has become irrelevant now that we have concurrent snapshots and all of this code has changed. The general idea is still a possible optimization though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants