Skip to content

Manage retention of failed snapshots in SLM #47617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 8, 2019

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Oct 4, 2019

Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: FAILED snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either min_count
or max_count.

Implements part of #46988

Labelled non-issue because this feature hasn't yet shipped.

Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
@gwbrown gwbrown added >non-issue :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.5.0 labels Oct 4, 2019
@gwbrown gwbrown requested a review from dakrone October 4, 2019 23:33
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM)

@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 4, 2019

I tried to write an integration test for this, but had a heck of a time getting a failed snapshot that actually ended up in the repository. I'll look a bit more to see if there's a way I'm missing.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Gordon, I left some comments. In terms of an integration test, I think you might be able to use MockRepository to "block" or cause errors during the snapshot the way that SLMSnapshotBlockingIntegTests does?

@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 7, 2019

I've pushed most of the changes but would still like to get an integration test in - I believe I've found a way - so you can hold off on re-reviewing until I get that in.

@gwbrown gwbrown requested a review from dakrone October 8, 2019 00:09
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for splitting some of the logic, it was easier to follow this time around.


logger.info("--> start snapshot");
ActionFuture<ExecuteSnapshotLifecycleAction.Response> snapshotFuture = client()
.execute(ExecuteSnapshotLifecycleAction.INSTANCE, new ExecuteSnapshotLifecycleAction.Request(policyId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an executePolicy helper that returns the snapshot name as a String (for future tests)

@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 8, 2019

org.gradle.internal.remote.internal.ConnectException: Could not connect to server [8c6184eb-af0f-4e53-8f8b-d35a4142e201 port:42333, addresses:[/0:0:0:0:0:0:0:1, /127.0.0.1]]. Tried addresses: [/0:0:0:0:0:0:0:1, /127.0.0.1]. 🤷‍♂

@elasticmachine run elasticsearch-ci/1

@gwbrown gwbrown merged commit e221f86 into elastic:master Oct 8, 2019
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 8, 2019
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
gwbrown added a commit that referenced this pull request Oct 8, 2019
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants