Description
openedon Apr 23, 2020
At the moment elasticsearch stats methods used by metricbeat's elasticsearch module don't have any internal timeouts, which means that elasticsearch will try to perform the request until it gets responses from all nodes or unresponsive nodes die. We have recently observed some cases (elastic/elasticsearch#50241 for example) where a data node in a small cluster was responding very very slowly but didn't disconnect from the cluster. Meanwhile metricbeat was sending requests to elasticsearch every 10 seconds with 10 seconds response timeout (default settings). Basically, we were adding 6 in-flight requests per minute. This caused an eventual accumulation of in-flight stats requests on the master node that cause it to crash with OOM error. We are addressing this issue on the elasticsearch side elastic/elasticsearch#55550 but I was hoping we can improve metricbeat's behavior as well by introducing an exponential backoff for the timeout value.