Skip to content

[Metricbeat] Exponential backoff for http timeout in elasticsearch module #17948

Open

Description

At the moment elasticsearch stats methods used by metricbeat's elasticsearch module don't have any internal timeouts, which means that elasticsearch will try to perform the request until it gets responses from all nodes or unresponsive nodes die. We have recently observed some cases (elastic/elasticsearch#50241 for example) where a data node in a small cluster was responding very very slowly but didn't disconnect from the cluster. Meanwhile metricbeat was sending requests to elasticsearch every 10 seconds with 10 seconds response timeout (default settings). Basically, we were adding 6 in-flight requests per minute. This caused an eventual accumulation of in-flight stats requests on the master node that cause it to crash with OOM error. We are addressing this issue on the elasticsearch side elastic/elasticsearch#55550 but I was hoping we can improve metricbeat's behavior as well by introducing an exponential backoff for the timeout value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions