Skip to content

Snapshot creations have huge heap footprint after abrupt full-cluster restart #89952

Closed
@DaveCTurner

Description

@DaveCTurner

If the cluster shuts down while updating the root repository data blob then it will set BlobStoreRepository#uncleanStart on startup, which causes Elasticsearch to skip the caching of RepositoryData in favour of reading the blob afresh from the repository each time it's needed.

If on startup ILM finds indices waiting to move to the searchable snapshot phase then it will attempt to create snapshots of each such index. Each create-snapshot task holds a reference to the RepositoryData it captured when the task was submitted.

The trouble is that each RepositoryData instance could be tens of MBs in size and while uncleanStart is set there is no sharing between these instances. In the case of this I saw, RepositoryData was ~58MiB and there were 17 create-snapshot tasks in the queue, so these tasks alone consumed almost 1GiB of heap. There were also 6 snapshot_meta threads all busy loading more copies of RepositoryData with a total of 530MiB of local state.

Relates #77466


Workaround

Clearing the uncleanStart flag should restore the caching (and hence sharing) of RepositoryData again:

  1. Disable ILM (needs to happen immediately after startup before it triggers any snapshots).
  2. Take a single snapshot manually to complete the pending write of the root metadata blob. The content of the snapshot doesn't matter, so you may as well restrict it to just a single small index.
  3. When that snapshot completes, it is safe to enable ILM again.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions