Skip to content

[Elasticsearch] Optimize handling of large cluster payloads #33862

Open
@miltonhultgren

Description

Background

In the past, Metricbeat's Elasticsearch module has created issues with performance because it consumes APIs that didn't scale well with the size of the monitored cluster. Thanks to a lot of effort by the Elasticsearch team, these APIs now perform much better.

Despite these improvements we still see issues in ESS but now it seems the problem is that Metricbeat is consuming too much CPU when parsing and processing the large responses that Elasticsearch returns. The effort for Elasticsearch to generate these responses is fairly small and thus if you look at the CPU usage of Elasticsearch itself it is low (on the master nodes where this happens), but we see performance issues because Metricbeat takes up the CPU trying to process the response, leaving little CPU for the master node to use which causes general instability.

A larger fix for this is outlined here, to make Metricbeat adopt it's resource usage based on available CPU to not crowd out the other processes that are running.

We may want to also consider revisiting elastic/kibana#130575 and seeing if we can get the same data through other APIs which may have smaller responses to process.

Short term improvement

We have gotten feedback that the code in the Elasticsearch module could be optimized to reduce the CPU/Memory usage as well as speed up the processing of responses.

The main culprits seem to be an excessive usage of mapstr and schema, as well as unmarshalling too much of the JSON response (more than we need to generate the event documents). We should also see if it's possible for us to reduce the amount of data we send to Elasticsearch since that also takes time when the cluster becomes larger.

Development tips

Metricbeat has cpuprofile and memprofile as flags you can use to enable resource profiling.

AC

  • Usage of mapstr is eliminated
  • Usage of schema is replaced with a hard coded Go struct that can be used for JSON parsing but only for the exact data we need
  • Documents are trimmed to only send fields that are indexed
  • A noticeable improvement in CPU usage is measured for large clusters

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions