Description
ILM has been included in Elasticsearch, which allows us to manage the lifecycle
of an index, however, this lifecycle management does not currently include
periodic snapshots of the index.
In order to provide a full replacement for other cluster periodic management
tools out there (such as Curator), we should add snapshot management to
Elasticsearch.
Ideally this would fall under the same sort of management than ILM provides, the
difference, however, is that snapshots are multi-index whereas index lifecycle
policies are applied to a single index (and all actions are executed on a single
index).
We need a way of specifying a periodic and/or scheduled snapshots of a given set
of indices using a specific repository, perhaps something like this (all of the
API is made up)
PUT /_slm/policy/snapshot-every-day
{
// Run this every day at 2:30am
"schedule": "0 30 2 * * ?",
// What the snapshot should be named, supporting date-math
"name": "<production-snap-{now/d}>",
// Which snapshot repository to use for the snapshot
"repository": "my-s3-repository",
// "config" is a map of all the options that the regular snapshot API takes
"config": {
"indices": ["foo-*", "important"],
"ignore_unavailable": true,
"include_global_state": false
}
}
Elasticsearch will then manage taking snapshots of the given indices for the
repository on the schedule specified. The status of the snapshots would have to
be stored somewhere, likely in an index (.tasks
perhaps?)
Some other things that would be nice (but not required) to support:
- Snapshots every N minutes. Where N only starts counting from the completion of
the previous snapshot (for example, a snapshot every 30 minutes that takes 4
minutes to complete would start a snapshot at 00:00, and then the next would
be 00:34 - 30 minutes after the completion of the previous snapshot). - Retention of snapshots. Specifying something like
"max_count": 10
meaning to
keep the last 10 snapshots, or"max_age": "7d"
meaning to keep a weeks'
worth of snapshots, the old snapshot deletion would be managed by ES.
Task Checklist
- Basic CRUD for snapshot lifecycle policies (@dakrone) Add SnapshotLifecycleService and related CRUD APIs #39795
- Correctly handle updates and deletes to snapshot lifecycle policies (@dakrone) Handle snapshot lifecycle policy updates and deletions #40062
- Issue snapshot request when job is triggered (@dakrone) Take a snapshot for the policy when the SLM policy is triggered #40383
- Persist debugging and error information about making snapshot requests (@gwbrown) Record most recent snapshot policy success/failure #40619
- Persist a history of successful/failed snapshots in an ES index (@gwbrown) Record SLM history into an index #41707
- Add validation for snapshot lifecycle policies (check repo exists and pass its validation, check snapshot name doesn't break S3, etc) (@dakrone) Validate snapshot lifecycle policies #40654
- Hook into the existing ILM stop/start so users can perform maintenance (@dakrone) Hook SLM into ILM's start and stop APIs #40871
- Change URI paths to be under
/_slm/policy
(currentlyGET|PUT|DELETE /_ilm/snapshot/<policy-id>
) (@dakrone) Change SLM endpoint from /_ilm/* to /_slm/* #41320 - Add API to execute a snapshot for a policy now rather than waiting for the scheduled time (@dakrone) Add API to execute SLM policy on demand #41038
- Display "the next time this policy will execute is: ____" with the success/failure/info when retrieving policy (@dakrone) Add next_execution to SLM policy metadata #41221
- Ensure that SLM has a dedicated cluster privilege and that its actions are separate from ILM actions (@dakrone) Add
manage_slm
andread_slm
cluster privileges #41607 - Documentation (@dakrone) Add initial documentation for SLM #41510
- Package level javadocs (@jbaiera) Add Snapshot Lifecycle Management Package Docs #43535
- Document what security privileges are necessary for using SLM (@dakrone) Add a note mentioning the privileges needed for SLM #43708
- High Level Rest Client Support (@dakrone) High Level Rest Client support for SLM #41767
- Testing
- Add integration test for SLM x-pack security role (dakrone) Add security test testing SLM cluster privileges #42678
- Manual testing (everyone)
- Retention
- Add support for
_meta
inCreateSnapshotRequest
(@gwbrown) Add custom metadata to snapshots #41281 - Send
_meta
associating each snapshot with the policy that created it (@gwbrown) Include SLM policy name in Snapshot metadata #43132 - Implement retention (See: Retention for Snapshot Lifecycle Management #43663)
- Add support for