Description
If the cluster shuts down while updating the root repository data blob then it will set BlobStoreRepository#uncleanStart
on startup, which causes Elasticsearch to skip the caching of RepositoryData
in favour of reading the blob afresh from the repository each time it's needed.
If on startup ILM finds indices waiting to move to the searchable snapshot phase then it will attempt to create snapshots of each such index. Each create-snapshot task holds a reference to the RepositoryData
it captured when the task was submitted.
The trouble is that each RepositoryData
instance could be tens of MBs in size and while uncleanStart
is set there is no sharing between these instances. In the case of this I saw, RepositoryData
was ~58MiB and there were 17 create-snapshot tasks in the queue, so these tasks alone consumed almost 1GiB of heap. There were also 6 snapshot_meta
threads all busy loading more copies of RepositoryData
with a total of 530MiB of local state.
Relates #77466
Workaround
Clearing the uncleanStart
flag should restore the caching (and hence sharing) of RepositoryData
again:
- Disable ILM (needs to happen immediately after startup before it triggers any snapshots).
- Take a single snapshot manually to complete the pending write of the root metadata blob. The content of the snapshot doesn't matter, so you may as well restrict it to just a single small index.
- When that snapshot completes, it is safe to enable ILM again.