Description
Elasticsearch version (bin/elasticsearch --version
): ECK 1.1.2 ELK 7.7.1
Plugins installed: [repository-azure]
JVM version (java -version
): Default by ECK docker image
OS version (uname -a
if on a Unix-like system): Default by ECK docker image
Description of the problem including expected versus actual behavior:
We run 8 ELK 7.7.1 clusters on ECK 1.1.2 on Azure AKS. They are of different sizes - with Heap size from 2 to 12 GB (heap size is always 50% of the available memory) and data size from 20 GB to 700 GB. All of them are usually healthy and have >~50% of heap size and disk size free.
We have automatic snapshot setup every ~6 hours on all of them. It happens randomly - one every few snapshots, on random instances, that the snapshot will cause heap memory to suddenly jump to 100%, cause OOM exception and crash at least one of cluster nodes (all clusters have 2 nodes).
It happens on big and small instances, but is not really predictable.
It seems that this issue started after an upgrade to ELK 7.7.1, earlier we run ECK 1.0 and ELK 7.6.1 and we did not notice this issue.
Steps to reproduce:
- Setup an ECK 1.1.2 with ELK 7.7.1 cluster on Azure AKS with repository-azure plugin
- Configure automatic snapshots - for instance every hour
- Watch for sudden spikes in Heap Memory usage, happening exactly when the snapshot starts
Provide logs (if relevant):