Description
We are running an elasticsearch cluster with 27 nodes and we create about 200 new indices per day.
We keep data for the last 5 days, so in total we have about 1000 indices. Every day we take a snapshot of yesterday's 200 indices in S3 (so, no incremental backups).
After updating to 1.4.2 we've noticed that when we try to register a snapshots repository we end up with the master node running out of heap space and the cluster going into an unresponsive state. I'm attaching a screenshot where you can see the heap space usage on the master node after making a PUT request to register a snapshots repository.
After inspecting a heap dump taken from the master node we realised that it's trying to list the contents of our s3 repository. At the moment we keep all previous snapshots in our s3 repository which means that it's impossible (in terms of time and resources) to list everything. I'm attaching a screenshot from Memory Analyser where you can see that 51% of the heap space (800 MB) is occupied by a Map storing 5.000.000 entries with PlainBlobMetadata objects as values and s3 locations as keys.
We recently updated to 1.4.2. (from 1.1.2) and we don't think we've seen a similar behaviour (i.e. listing s3 repository contents) in 1.1.2. Could be an issue that needs further investigation on your side or could be the way we are using the snapshots service at the moment that is causing us problems?
Any suggestions/feedback would be welcome.
Thank you