Skip to content

Elastic Snapshots cause out of memory exceptions #60173

Closed
@ecovadis-devops

Description

@ecovadis-devops

Elasticsearch version (bin/elasticsearch --version): ECK 1.1.2 ELK 7.7.1

Plugins installed: [repository-azure]

JVM version (java -version): Default by ECK docker image

OS version (uname -a if on a Unix-like system): Default by ECK docker image

Description of the problem including expected versus actual behavior:
We run 8 ELK 7.7.1 clusters on ECK 1.1.2 on Azure AKS. They are of different sizes - with Heap size from 2 to 12 GB (heap size is always 50% of the available memory) and data size from 20 GB to 700 GB. All of them are usually healthy and have >~50% of heap size and disk size free.

We have automatic snapshot setup every ~6 hours on all of them. It happens randomly - one every few snapshots, on random instances, that the snapshot will cause heap memory to suddenly jump to 100%, cause OOM exception and crash at least one of cluster nodes (all clusters have 2 nodes).
image
It happens on big and small instances, but is not really predictable.
It seems that this issue started after an upgrade to ELK 7.7.1, earlier we run ECK 1.0 and ELK 7.6.1 and we did not notice this issue.

Steps to reproduce:

  1. Setup an ECK 1.1.2 with ELK 7.7.1 cluster on Azure AKS with repository-azure plugin
  2. Configure automatic snapshots - for instance every hour
  3. Watch for sudden spikes in Heap Memory usage, happening exactly when the snapshot starts

Provide logs (if relevant):

Metadata

Metadata

Labels

:Distributed Coordination/Snapshot/RestoreAnything directly related to the `_snapshot/*` APIs>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.needs:triageRequires assignment of a team area label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions