You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have observed high memory usage in our open telemetry collectors which are set up to pull data from a kafka cluster and push to a prometheus remote write endpoint. In order to handle the volume of data we have been increasing the batch size of the batch processor beyond the default (8192) to (50,000) or more. This is due to the fact that prometheus remote write exporter only parallelizes sends within a batch and never parallelizes batches themselves.
While 50k batch size works most of the time we have seen cases where this config runs behind so we have also tried increasing it beyond 50k to 100k/200k or more. When we did this we noticed a dramatic increase in the amount of memory utilized by our collectors.
With a pprof profile we discovered that a huge portion of memory allocations was occuring in this single function "prometheusremotewriteexporter.batchTimeSeries"
Reviewing this function it allocates many large buffers with max capacity set to the full size of the batch even though we expect the individual requests being batched to be smaller. We patched this locally to allocate smaller buffers and observed a huge reduction in memory usage and in our test environment going from ~80gb to ~30-40gb across 11 pods.
I will prepare a PR with the proposed patch for further discussion.
Steps to Reproduce
Configure a metrics pipeline with prometheus remote write exporter and a large batch size (e.g. 100k data points or more). Send large volumes of data through the pipeline, observe the memory allocations in pprof.
Expected Result
Actual Result
Collector version
v0.101.0
Environment information
Environment
OS: Ubuntu 20.04
Compiler(if manually compiled): go 1.21.12
ben-childs-docusign
changed the title
Excessive buffer allocations in prometheus remote write exporter with large batch sizes
[exporter/prometheusremotewrite] Excessive buffer allocations in prometheus remote write exporter with large batch sizes
Jul 26, 2024
Component(s)
exporter/prometheusremotewrite
What happened?
Description
We have observed high memory usage in our open telemetry collectors which are set up to pull data from a kafka cluster and push to a prometheus remote write endpoint. In order to handle the volume of data we have been increasing the batch size of the batch processor beyond the default (8192) to (50,000) or more. This is due to the fact that prometheus remote write exporter only parallelizes sends within a batch and never parallelizes batches themselves.
While 50k batch size works most of the time we have seen cases where this config runs behind so we have also tried increasing it beyond 50k to 100k/200k or more. When we did this we noticed a dramatic increase in the amount of memory utilized by our collectors.
With a pprof profile we discovered that a huge portion of memory allocations was occuring in this single function "prometheusremotewriteexporter.batchTimeSeries"
Reviewing this function it allocates many large buffers with max capacity set to the full size of the batch even though we expect the individual requests being batched to be smaller. We patched this locally to allocate smaller buffers and observed a huge reduction in memory usage and in our test environment going from ~80gb to ~30-40gb across 11 pods.
I will prepare a PR with the proposed patch for further discussion.
Steps to Reproduce
Configure a metrics pipeline with prometheus remote write exporter and a large batch size (e.g. 100k data points or more). Send large volumes of data through the pipeline, observe the memory allocations in pprof.
Expected Result
Actual Result
Collector version
v0.101.0
Environment information
Environment
OS: Ubuntu 20.04
Compiler(if manually compiled): go 1.21.12
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: