Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] Excessive buffer allocations in prometheus remote write exporter with large batch sizes #34269

Closed
ben-childs-docusign opened this issue Jul 26, 2024 · 1 comment · Fixed by #34271
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage

Comments

@ben-childs-docusign
Copy link
Contributor

ben-childs-docusign commented Jul 26, 2024

Component(s)

exporter/prometheusremotewrite

What happened?

Description

We have observed high memory usage in our open telemetry collectors which are set up to pull data from a kafka cluster and push to a prometheus remote write endpoint. In order to handle the volume of data we have been increasing the batch size of the batch processor beyond the default (8192) to (50,000) or more. This is due to the fact that prometheus remote write exporter only parallelizes sends within a batch and never parallelizes batches themselves.

While 50k batch size works most of the time we have seen cases where this config runs behind so we have also tried increasing it beyond 50k to 100k/200k or more. When we did this we noticed a dramatic increase in the amount of memory utilized by our collectors.

With a pprof profile we discovered that a huge portion of memory allocations was occuring in this single function "prometheusremotewriteexporter.batchTimeSeries"
e472a152-3503-488e-940a-7feba47810be

Reviewing this function it allocates many large buffers with max capacity set to the full size of the batch even though we expect the individual requests being batched to be smaller. We patched this locally to allocate smaller buffers and observed a huge reduction in memory usage and in our test environment going from ~80gb to ~30-40gb across 11 pods.

image

I will prepare a PR with the proposed patch for further discussion.

Steps to Reproduce

Configure a metrics pipeline with prometheus remote write exporter and a large batch size (e.g. 100k data points or more). Send large volumes of data through the pipeline, observe the memory allocations in pprof.

Expected Result

Actual Result

Collector version

v0.101.0

Environment information

Environment

OS: Ubuntu 20.04
Compiler(if manually compiled): go 1.21.12

OpenTelemetry Collector configuration

      prometheusremotewrite:
        auth:
          authenticator: bearertokenauth
        endpoint: [SCRUBBED]
        max_batch_size_bytes: 10000000
        remote_write_queue:
          enabled: true
          num_consumers: 25
        resource_to_telemetry_conversion:
          enabled: true
        timeout: 30s
 batch:
        send_batch_size: 200000
        timeout: 5s

Log output

No response

Additional context

No response

@ben-childs-docusign ben-childs-docusign added bug Something isn't working needs triage New item requiring triage labels Jul 26, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ben-childs-docusign ben-childs-docusign changed the title Excessive buffer allocations in prometheus remote write exporter with large batch sizes [exporter/prometheusremotewrite] Excessive buffer allocations in prometheus remote write exporter with large batch sizes Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage
Projects
None yet
1 participant