Manage retention of failed snapshots in SLM #47617

gwbrown · 2019-10-04T23:33:03Z

Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: FAILED snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either min_count
or max_count.

Implements part of #46988

Labelled non-issue because this feature hasn't yet shipped.

Failed snapshots will eventually build up unless they are deleted. While failures may not take up much space, they add noise to the list of snapshots and it's desirable to remove them when they are no longer useful. With this change, failed snapshots are deleted using the following strategy: `FAILED` snapshots will be kept until the configured `expire_after` period has passed, if present, and then be deleted. If there is no configured `expire_after` in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either `min_count` or `max_count`.

elasticmachine · 2019-10-04T23:33:05Z

Pinging @elastic/es-core-features (:Core/Features/ILM)

gwbrown · 2019-10-04T23:33:50Z

I tried to write an integration test for this, but had a heck of a time getting a failed snapshot that actually ended up in the repository. I'll look a bit more to see if there's a way I'm missing.

dakrone

Thanks for working on this Gordon, I left some comments. In terms of an integration test, I think you might be able to use MockRepository to "block" or cause errors during the snapshot the way that SLMSnapshotBlockingIntegTests does?

...ugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotRetentionConfiguration.java

gwbrown · 2019-10-07T20:50:38Z

I've pushed most of the changes but would still like to get an integration test in - I believe I've found a way - so you can hold off on re-reviewing until I get that in.

dakrone

LGTM, thanks for splitting some of the logic, it was easier to follow this time around.

dakrone · 2019-10-08T01:48:02Z

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/slm/SLMSnapshotBlockingIntegTests.java

+
+            logger.info("-->  start snapshot");
+            ActionFuture<ExecuteSnapshotLifecycleAction.Response> snapshotFuture = client()
+                .execute(ExecuteSnapshotLifecycleAction.INSTANCE, new ExecuteSnapshotLifecycleAction.Request(policyId));


There's an executePolicy helper that returns the snapshot name as a String (for future tests)

gwbrown · 2019-10-08T18:03:05Z

org.gradle.internal.remote.internal.ConnectException: Could not connect to server [8c6184eb-af0f-4e53-8f8b-d35a4142e201 port:42333, addresses:[/0:0:0:0:0:0:0:1, /127.0.0.1]]. Tried addresses: [/0:0:0:0:0:0:0:1, /127.0.0.1]. 🤷‍♂

@elasticmachine run elasticsearch-ci/1

Failed snapshots will eventually build up unless they are deleted. While failures may not take up much space, they add noise to the list of snapshots and it's desirable to remove them when they are no longer useful. With this change, failed snapshots are deleted using the following strategy: `FAILED` snapshots will be kept until the configured `expire_after` period has passed, if present, and then be deleted. If there is no configured `expire_after` in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either `min_count` or `max_count`.

gwbrown added 5 commits October 4, 2019 15:06

Don't count failed snaps towards min or max counts

48fe244

Delete failures if no expiry & newer success

0b66f75

Cleanup

dee10b1

Merge branch 'master' into slm/retain-failed-snaps

1598c1b

gwbrown added >non-issue :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.5.0 labels Oct 4, 2019

gwbrown requested a review from dakrone October 4, 2019 23:33

dakrone requested changes Oct 7, 2019

View reviewed changes

gwbrown added 3 commits October 7, 2019 14:35

Fix finding the newest successful timestamp

e5f4f74

Other review feedback

0ccedb4

Merge branch 'master' into slm/retain-failed-snaps

1fdbf4b

dakrone mentioned this pull request Oct 7, 2019

Retention for Snapshot Lifecycle Management #43663

Closed

26 tasks

Integration test for retention of failed snapshots

27a6616

gwbrown requested a review from dakrone October 8, 2019 00:09

dakrone approved these changes Oct 8, 2019

View reviewed changes

gwbrown merged commit e221f86 into elastic:master Oct 8, 2019

gwbrown mentioned this pull request Oct 8, 2019

[7.x] Manage retention of failed snapshots in SLM (#47617) #47753

Merged

gwbrown mentioned this pull request Oct 8, 2019

Handle retention of failed and partial snapshots in SLM #46988

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Manage retention of failed snapshots in SLM #47617

Manage retention of failed snapshots in SLM #47617

Uh oh!

gwbrown commented Oct 4, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Oct 4, 2019

Uh oh!

gwbrown commented Oct 4, 2019

Uh oh!

dakrone left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gwbrown commented Oct 7, 2019

Uh oh!

dakrone left a comment

Uh oh!

dakrone Oct 8, 2019

Uh oh!

gwbrown commented Oct 8, 2019

Uh oh!

Uh oh!

Manage retention of failed snapshots in SLM #47617

Manage retention of failed snapshots in SLM #47617

Uh oh!

Conversation

gwbrown commented Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Oct 4, 2019

Uh oh!

gwbrown commented Oct 4, 2019

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gwbrown commented Oct 7, 2019

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

dakrone Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

gwbrown commented Oct 8, 2019

Uh oh!

Uh oh!

gwbrown commented Oct 4, 2019 •

edited

Loading