[Elasticsearch] Optimize handling of large cluster payloads #33862
Description
Background
In the past, Metricbeat's Elasticsearch module has created issues with performance because it consumes APIs that didn't scale well with the size of the monitored cluster. Thanks to a lot of effort by the Elasticsearch team, these APIs now perform much better.
Despite these improvements we still see issues in ESS but now it seems the problem is that Metricbeat is consuming too much CPU when parsing and processing the large responses that Elasticsearch returns. The effort for Elasticsearch to generate these responses is fairly small and thus if you look at the CPU usage of Elasticsearch itself it is low (on the master nodes where this happens), but we see performance issues because Metricbeat takes up the CPU trying to process the response, leaving little CPU for the master node to use which causes general instability.
A larger fix for this is outlined here, to make Metricbeat adopt it's resource usage based on available CPU to not crowd out the other processes that are running.
We may want to also consider revisiting elastic/kibana#130575 and seeing if we can get the same data through other APIs which may have smaller responses to process.
Short term improvement
We have gotten feedback that the code in the Elasticsearch module could be optimized to reduce the CPU/Memory usage as well as speed up the processing of responses.
The main culprits seem to be an excessive usage of mapstr
and schema
, as well as unmarshalling too much of the JSON response (more than we need to generate the event documents). We should also see if it's possible for us to reduce the amount of data we send to Elasticsearch since that also takes time when the cluster becomes larger.
Development tips
Metricbeat has cpuprofile
and memprofile
as flags you can use to enable resource profiling.
AC
- Usage of
mapstr
is eliminated - Usage of
schema
is replaced with a hard coded Go struct that can be used for JSON parsing but only for the exact data we need - Documents are trimmed to only send fields that are indexed
- A noticeable improvement in CPU usage is measured for large clusters