[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938

swar8080 · 2024-08-30T11:31:09Z

Component(s)

exporter/prometheus

Is your feature request related to a problem? Please describe.

There's currently no limit on the number of metric series cached by this component. This increases the risk of destabilizing collectors during a cardinality explosion or increase in cardinality because of high memory

The cache metric_expiration/TTL option is useful but its hard to tune it correctly. If the expiration is too short you get a lot of counter resets which seems to decrease usability with promql, and too long and you increase the risk of high memory

In our prometheus set-up, we configure a sample_limit on collector scrape targets to reduce the impact of a cardinality explosion on prometheus health and on cost. Ideally we could set the collector cache's max size slightly higher than the sample_limit

Describe the solution you'd like

An optional configuration to limit the size of this component's metric cache. When the capacity is exceeded, maybe it prioritizes discarding datapoints with the oldest timestamps. Probably temporarily exceeding the cache size is okay if it makes the implementation easier or more performant, similar to how expired items aren't purged until the next scrape

Describe alternatives you've considered

No response

Additional context

I can take a stab at implementing this. At first glance golang-lru might allow for the "last updated" purging strategy but would need to dig deeper into thread safety and performance

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-30T11:31:25Z

Pinging code owners:

exporter/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-09-03T15:13:50Z

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.

opentelemetry-collector-contrib/exporter/prometheusexporter/accumulator.go

Line 106 in c6cda87

if ip.Flags().NoRecordedValue() {

swar8080 · 2024-09-14T18:26:30Z

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.

opentelemetry-collector-contrib/exporter/prometheusexporter/accumulator.go

Line 106 in c6cda87

if ip.Flags().NoRecordedValue() {

That suggestion helps a lot. We're using OTLP to push cumulative metrics to the collector. I didn't realize the OTEL SDK sends a data point even if it hasn't been updated - so we don't have to worry about excessive counter resets. Combining that with the SDK cardinality limit should protect against label cardinality explosions

The use-case i'm not sure about is span metrics sent to prometheusexporter. We've had a few cardinality explosions caused by unique span names. The span metrics connector can either emit span counts as deltas as spans come in, or keep track of cumulative counts that're re-emitted on an interval like an app SDK would. If emitting delta temporality then there's a lot of counter resets for infrequent series if prometheusexporter TTL is low. We're currently using a TTL of 24h. If using cumulative span metrics it solves the counter reset problem and allows low prometheus TTL, but it's shifting the unbounded memory problem to the span metric cache, since it also only supports TTL. So adding the series limit in span metrics connector could be another solution

github-actions · 2024-11-14T03:33:45Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

swar8080 added enhancement New feature or request needs triage New item requiring triage labels Aug 30, 2024

github-actions bot added the exporter/prometheus label Aug 30, 2024

github-actions bot mentioned this issue Sep 3, 2024

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

dashpole removed the needs triage New item requiring triage label Sep 3, 2024

dashpole self-assigned this Sep 3, 2024

github-actions bot added the Stale label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938

[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938

swar8080 commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

dashpole commented Sep 3, 2024

swar8080 commented Sep 14, 2024

github-actions bot commented Nov 14, 2024