Description
Description
This has been brought up a few times in different forms, see #46161, #73349, #83039, #85054.
Any ILM policy with a max_age
associated with the rollover
action could trigger this scenario, but in order to talk about something concrete, I'll use metricbeat as an example (double emphasizing, though, this isn't unique to metricbeat, it's just the nature of the way rollover
currently works with a max_age
).
With a test 8.1.3 Elasticsearch cluster, I ran metricbeat-8.1.2
for a few seconds and then stopped it, and then metricbeat-8.1.3
for a bit longer. The default metricbeat
policy has rollover with "max_age" : "30d"
(30 days) but in order to illustrate this problem better, I've set that to "1m"
(1 minute) instead:
PUT /_cluster/settings
{
"persistent": {
"indices.lifecycle.poll_interval": "5s"
}
}
PUT _ilm/policy/metricbeat
{
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "1m"
}
}
}
}
}
}
After a few minutes, my cluster looks like this:
GET _cat/indices/.ds-metricbeat-*?s=index
yellow open .ds-metricbeat-8.1.2-2022.04.26-000001 GBqDAprYSl2NmFzi81n9Ug 1 1 1134 0 652.2kb 652.2kb
yellow open .ds-metricbeat-8.1.2-2022.04.26-000002 Ybd4SCiWT0-7W0v9zKPR4A 1 1 0 0 225b 225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000003 3_9OmFkOSKaEfF_J-_D9TA 1 1 0 0 225b 225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000004 4olQItwcTtCOWrBotcqoLw 1 1 0 0 225b 225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000005 N9_gYkcORWSVfacUwnDegw 1 1 0 0 225b 225b
yellow open .ds-metricbeat-8.1.3-2022.04.26-000001 kWW-N_bfRbO0vMR4z3F72g 1 1 862 0 639.2kb 639.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000002 qzu-L-zZQqqm-6GQAZMtgA 1 1 235 0 431.2kb 431.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000003 iW68NzFyTv-CCAg3Rsfj4A 1 1 265 0 494kb 494kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000004 NaIa-gUjShKpEHzNcAxA3w 1 1 234 0 451.7kb 451.7kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000005 lDhzxqtdR8miPnqvsf7HDQ 1 1 271 0 595.8kb 595.8kb
That is, for a little while, the first writer (metricbeat version 8.1.2) wrote documents, and then it stopped and was upgraded and replaced by the second writer (metricbeat version 8.1.3). Each of those writers uses a versioned datastream (metricbeat-8.1.2
and metricbeat-8.1.3
respectively).
The problem is easy to see -- notice that we're getting a new empty (0 document) .ds-metricbeat-8.1.2-[...]
index every minute, and that we'll keep accumulating them forever. ILM doesn't have any special logic around empty indices like this, i.e. empty indices are treated the same as non-empty indices as far as ILM is concerned.
In this simple scenario, we know that the metricbeat-8.1.2
datastream is done now, and can be retired. However, there's no particular point in time where Elasticsearch itself or some individual metricbeat process could know that. I'm using just one metricbeat writer, but I could be running one on each of N hosts. No one writer process in this scenario knows that it is special and should "turn off the lights when it's done".
To further complicate matters, maybe I have a weekly batch process which will run on Sunday evening and write some logs after a long quiet period (and its logs are still being monitored by metricbeat version 8.1.2)-- when it does so we could end up with more data flowing into the current metricbeat-8.1.2
write index. Let's call that the "sporadic writer" case. In that case, we'd end up with periods of no data flowing in and the accumulation of empty indices, followed by one or more non-empty indices, and then back to accumulating empty indices again.
ILM doesn't know whether there's a sporadic writer out there or not, and ignorant of whether more documents will be coming one day, it dutifully executes the policy, rolling over the now defunct metricbeat-8.1.2
datastream every minute and leaving a trail of empty .ds-metricbeat-8.1.2-[...]
indices in its wake.
An additional note: my illustration here is datastream specific, but in the broad strokes this issue could also exist in a pre-datastream indexing strategy built around aliases. It would be most excellent if we were able to solve both the datastream and alias -based versions of this empty index problem (but reserving a degree of freedom, I don't think the solution must necessarily be precisely the same in both cases).