Description
SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.
Policy definition would change to something like:
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced slm.retention_schedule
cluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).
Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.
Potential list of snapshot conditions:
- age-based retention (delete snapshots after N days)
- minimum number of snapshots to keep
- maximum number of snapshots to allow (delete oldest if there are too many)
Some things to work out
- What should we do with FAILED/PARTIAL snapshots? Should they be treated as subject to retention? Separate retention?
For the first release, treating PARTIAL as failed and not eligible for retention
- Are there retry policies for deletion, or should we wait for the next invocation of the retention task
- Does the order of old snapshot deletion matter?
Oldest snapshots will be deleted first
Task Checklist
- Add support for
_meta
inCreateSnapshotRequest
(@gwbrown) Add custom metadata to snapshots #41281 - Send
_meta
associating each snapshot with the policy that created it (@gwbrown) Include SLM policy name in Snapshot metadata #43132 - Create the feature branch (
slm-retention
) (@dakrone) Add base framework for snapshot retention #43605 - Modify
SnapshotLifecyclePolicy
to support retention configuration (@dakrone) Add SnapshotRetentionConfiguration for retention configuration #43777 - Modify
SnapshotRetentionTask
to implement snapshot deletion (@dakrone) Implement SnapshotRetentionTask's snapshot filtering and deletion #44764 - Implement the rest of the
SnapshotRetentionConfiguration
predicates (@dakrone) Add min_count and max_count as SLM retention predicates #44926 - Add separate API reporting of retention statistics/information (@dakrone) Add SLM metrics gathering and endpoint #45362
- Add HLRC support for SLM stats endpoint (@dakrone) Expose SLM policy stats in get SLM policy API #45989
- Add per-policy retention metrics to the GetSnapshotLifecyclePolicy API (@dakrone) Expose SLM policy stats in get SLM policy API #45989
- Make retention wait for ongoing snapshots before attempting to delete (@dakrone) Retry SLM retention after currently running snapshot completes #45802
- Store snapshot retention actions in the SLM history index (@gwbrown) Record history of SLM retention actions #45513
- Add or document creating a Watch that allows reporting failed SLM retention info (@gwbrown)
- Time-bound time spent in retention snapshot deletion (@dakrone) Time-bound deletion of snapshots in retention delete function #45065
- Add version checks to ensure we are compatible with 7.4
- Ensure SLM retention obeys the ILM stop/start
OperationMode
(@dakrone) Skip SLM retention if ILM is STOPPING or STOPPED #45869 Investigate retention of data in snapshots based on document/data age (put into snap meta?) instead of snapshot age+~ see: Implement retention of snapshots based on the document's timestamp date #45252- Update SLM stats not to use dynamic key names (@gwbrown) Change SLM stats format #46991
- Decide on treatment of
FAILURE
andPARTIAL
snapshots Handle retention of failed and partial snapshots in SLM #46988 (@gwbrown) Manage retention of failed snapshots in SLM #47617 - Merge to master
- Add API to execute retention manually (@dakrone) Add API to execute SLM retention on-demand #47405
Add cooldown period in between SLM operations Add a configurable cooldown period between SLM operations #47520 (@dakrone)- Documentation (@dakrone) Add Snapshot Lifecycle Retention documentation #47545
- Decide on a default retention schedule (@dakrone) Set default SLM retention invocation time #47604
- Separate start/stop/status API from ILM (@dakrone) Separate SLM stop/start/status API from ILM #47710
- Testing
- Tests with security (@dakrone) Add a test for SLM retention with security enabled #47608
- Tests that we handle ongoing snapshots and/or deletion correctly (added in Retry SLM retention after currently running snapshot completes #45802)
- Test BWC and cluster restarting (to avoid bugs like SLM metadata incorrects skips parsing operation mode #46499) (@dakrone)