Skip to content

SLM retention fails if there are multiple policies using different repositories and one respository does not exist #92849

Closed
@dakrone

Description

@dakrone

Elasticsearch Version

7.x and 8.x

Installed Plugins

No response

Java Version

bundled

OS Version

All

Problem Description

We retrieve all snapshots for all repositories at once, however, if we fail to retrieve them, we consider the SLM task as failed entirely, even though we might have been able to get snapshots individually. We should probably do a pre-check for the repository existence in the cluster state and then filter any missing repositories out when we retrieve snapshots for SLM retention execution.

Steps to Reproduce

Run ES with ./gradlew run -Dtests.es.path.repo=/tmp, then:

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.xpack.slm":"TRACE"
  }
}

PUT /_snapshot/repo
{
  "type": "fs",
  "settings": {
    "location": "/tmp/foo"
  }
}

PUT /_snapshot/missing
{
  "type": "fs",
  "settings": {
    "location": "/tmp/foo2"
  }
}

PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 30 1 * * ?",
  "name": "<daily-snap-{now/d}>",
  "repository": "repo",
  "config": {
    "ignore_unavailable": false,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "1s"
  }
}

PUT /_slm/policy/daily-snapshots2
{
  "schedule": "0 30 1 * * ?",
  "name": "<daily-snap-{now/d}>",
  "repository": "missing",
  "config": {
    "ignore_unavailable": false,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "1s"
  }
}

DELETE /_snapshot/missing

GET /_slm/policy

PUT /_slm/policy/daily-snapshots/_execute

POST /_slm/_execute_retention

Logs (if relevant)

The failure in the logs will look like:

[2023-01-11T13:47:27,594][INFO ][o.e.x.s.a.TransportExecuteSnapshotRetentionAction] [runTask-0] manually triggering SLM snapshot retention
[2023-01-11T13:47:27,595][INFO ][o.e.x.s.SnapshotRetentionTask] [runTask-0] starting SLM retention snapshot cleanup task
[2023-01-11T13:47:27,596][TRACE][o.e.x.s.SnapshotRetentionTask] [runTask-0] policies with retention enabled: [daily-snapshots, daily-snapshots2]
[2023-01-11T13:47:27,596][TRACE][o.e.x.s.SnapshotRetentionTask] [runTask-0] fetching snapshots from repositories: [repo, missing]
[2023-01-11T13:47:27,599][DEBUG][o.e.x.s.SnapshotRetentionTask] [runTask-0] unable to retrieve snapshots for [[repo, missing]] repositories org.elasticsearch.repositories.RepositoryMissingException: [missing] missing
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.repositories.get.TransportGetRepositoriesAction.getRepositories(TransportGetRepositoriesAction.java:105)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:116)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:67)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.support.master.TransportMasterNodeAction.executeMasterOperation(TransportMasterNodeAction.java:124)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:235)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:958)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

[2023-01-11T13:47:27,603][ERROR][o.e.x.s.SnapshotRetentionTask] [runTask-0] error during snapshot retention task org.elasticsearch.repositories.RepositoryMissingException: [missing] missing
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.repositories.get.TransportGetRepositoriesAction.getRepositories(TransportGetRepositoriesAction.java:105)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:116)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:67)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.support.master.TransportMasterNodeAction.executeMasterOperation(TransportMasterNodeAction.java:124)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:235)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:958)
	at org.elasticsearch.server@8.7.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions