Prevent ILM from spuriously rolling over (many) empty indices

### Description

This has been brought up a few times in different forms, see https://github.com/elastic/elasticsearch/issues/46161, https://github.com/elastic/elasticsearch/issues/73349, https://github.com/elastic/elasticsearch/issues/83039, https://github.com/elastic/elasticsearch/issues/85054.

Any ILM policy with a `max_age` associated with the `rollover` action could trigger this scenario, but in order to talk about something concrete, I'll use metricbeat as an example (double emphasizing, though, this isn't unique to metricbeat, it's just the nature of the way `rollover` currently works with a `max_age`).

With a test 8.1.3 Elasticsearch cluster, I ran `metricbeat-8.1.2` for a few seconds and then stopped it, and then `metricbeat-8.1.3` for a bit longer. The default `metricbeat` policy has rollover with `"max_age" : "30d"` (30 days) but in order to illustrate this problem better, I've set that to `"1m"` (1 minute) instead:

```
PUT /_cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s"
  }
}

PUT _ilm/policy/metricbeat
{
  "policy" : {
    "phases" : {
      "hot" : {
        "min_age" : "0ms",
        "actions" : {
          "rollover" : {
            "max_size" : "50gb",
            "max_age" : "1m"
          }
        }
      }
    }
  }
}
```

After a few minutes, my cluster looks like this:

```
GET _cat/indices/.ds-metricbeat-*?s=index
yellow open .ds-metricbeat-8.1.2-2022.04.26-000001 GBqDAprYSl2NmFzi81n9Ug 1 1 1134 0 652.2kb 652.2kb
yellow open .ds-metricbeat-8.1.2-2022.04.26-000002 Ybd4SCiWT0-7W0v9zKPR4A 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000003 3_9OmFkOSKaEfF_J-_D9TA 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000004 4olQItwcTtCOWrBotcqoLw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000005 N9_gYkcORWSVfacUwnDegw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.3-2022.04.26-000001 kWW-N_bfRbO0vMR4z3F72g 1 1  862 0 639.2kb 639.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000002 qzu-L-zZQqqm-6GQAZMtgA 1 1  235 0 431.2kb 431.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000003 iW68NzFyTv-CCAg3Rsfj4A 1 1  265 0   494kb   494kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000004 NaIa-gUjShKpEHzNcAxA3w 1 1  234 0 451.7kb 451.7kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000005 lDhzxqtdR8miPnqvsf7HDQ 1 1  271 0 595.8kb 595.8kb
```

That is, for a little while, the first writer (metricbeat version 8.1.2) wrote documents, and then it stopped and was upgraded and replaced by the second writer (metricbeat version 8.1.3). Each of those writers uses a versioned datastream (`metricbeat-8.1.2` and `metricbeat-8.1.3` respectively).

The problem is easy to see -- notice that we're getting a new empty (0 document) `.ds-metricbeat-8.1.2-[...]` index every minute, and that we'll keep accumulating them forever. ILM doesn't have any special logic around empty indices like this, i.e. empty indices are treated the same as non-empty indices as far as ILM is concerned.

In this simple scenario, we know that the `metricbeat-8.1.2` datastream is done now, and can be retired. However, there's no particular point in time where Elasticsearch itself or some individual metricbeat process could know that. I'm using just one metricbeat writer, but I could be running one on each of N hosts. No one writer process in this scenario knows that it is special and should "turn off the lights when it's done".

To further complicate matters, maybe I have a weekly batch process which will run on Sunday evening and write some logs after a long quiet period (and its logs are still being monitored by metricbeat version 8.1.2)-- when it does so we could end up with more data flowing into the current `metricbeat-8.1.2` write index. Let's call that the "sporadic writer" case. In that case, we'd end up with periods of no data flowing in and the accumulation of empty indices, followed by one or more non-empty indices, and then back to accumulating empty indices again.

ILM doesn't know whether there's a sporadic writer out there or not, and ignorant of whether more documents will be coming one day, it dutifully executes the policy, rolling over the now defunct `metricbeat-8.1.2` datastream every minute and leaving a trail of empty `.ds-metricbeat-8.1.2-[...]` indices in its wake.

An additional note: my illustration here is datastream specific, but in the broad strokes this issue could also exist in a pre-datastream indexing strategy built around aliases. It would be most excellent if we were able to solve both the datastream and alias -based versions of this empty index problem (but reserving a degree of freedom, I don't think the solution must necessarily be precisely the same in both cases).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent ILM from spuriously rolling over (many) empty indices #86203

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prevent ILM from spuriously rolling over (many) empty indices #86203

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions