Snapshot creations have huge heap footprint after abrupt full-cluster restart

If the cluster shuts down while updating the root repository data blob then it will set `BlobStoreRepository#uncleanStart` on startup, which causes Elasticsearch to skip the caching of `RepositoryData` in favour of reading the blob afresh from the repository each time it's needed.

If on startup ILM finds indices waiting to move to the searchable snapshot phase then it will attempt to create snapshots of each such index. Each create-snapshot task holds a reference to the `RepositoryData` it captured when the task was submitted.

The trouble is that each `RepositoryData` instance could be tens of MBs in size and while `uncleanStart` is set there is no sharing between these instances. In the case of this I saw, `RepositoryData` was ~58MiB and there were 17 create-snapshot tasks in the queue, so these tasks alone consumed almost 1GiB of heap. There were also 6 `snapshot_meta` threads all busy loading more copies of `RepositoryData` with a total of 530MiB of local state.

Relates #77466

---

## Workaround

Clearing the `uncleanStart` flag should restore the caching (and hence sharing) of `RepositoryData` again:

1. Disable ILM (needs to happen immediately after startup before it triggers any snapshots).
2. Take a _single_ snapshot manually to complete the pending write of the root metadata blob. The content of the snapshot doesn't matter, so you may as well restrict it to just a single small index.
3. When that snapshot completes, it is safe to enable ILM again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Snapshot creations have huge heap footprint after abrupt full-cluster restart #89952

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Snapshot creations have huge heap footprint after abrupt full-cluster restart #89952

Description

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions