Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938

Open
swar8080 opened this issue Aug 30, 2024 · 3 comments
Assignees
Labels

Comments

@swar8080
Copy link
Contributor

Component(s)

exporter/prometheus

Is your feature request related to a problem? Please describe.

There's currently no limit on the number of metric series cached by this component. This increases the risk of destabilizing collectors during a cardinality explosion or increase in cardinality because of high memory

The cache metric_expiration/TTL option is useful but its hard to tune it correctly. If the expiration is too short you get a lot of counter resets which seems to decrease usability with promql, and too long and you increase the risk of high memory

In our prometheus set-up, we configure a sample_limit on collector scrape targets to reduce the impact of a cardinality explosion on prometheus health and on cost. Ideally we could set the collector cache's max size slightly higher than the sample_limit

Describe the solution you'd like

An optional configuration to limit the size of this component's metric cache. When the capacity is exceeded, maybe it prioritizes discarding datapoints with the oldest timestamps. Probably temporarily exceeding the cache size is okay if it makes the implementation easier or more performant, similar to how expired items aren't purged until the next scrape

Describe alternatives you've considered

No response

Additional context

I can take a stab at implementing this. At first glance golang-lru might allow for the "last updated" purging strategy but would need to dig deeper into thread safety and performance

@swar8080 swar8080 added enhancement New feature or request needs triage New item requiring triage labels Aug 30, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

dashpole commented Sep 3, 2024

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.

@dashpole dashpole removed the needs triage New item requiring triage label Sep 3, 2024
@dashpole dashpole self-assigned this Sep 3, 2024
@swar8080
Copy link
Contributor Author

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.

That suggestion helps a lot. We're using OTLP to push cumulative metrics to the collector. I didn't realize the OTEL SDK sends a data point even if it hasn't been updated - so we don't have to worry about excessive counter resets. Combining that with the SDK cardinality limit should protect against label cardinality explosions

The use-case i'm not sure about is span metrics sent to prometheusexporter. We've had a few cardinality explosions caused by unique span names. The span metrics connector can either emit span counts as deltas as spans come in, or keep track of cumulative counts that're re-emitted on an interval like an app SDK would. If emitting delta temporality then there's a lot of counter resets for infrequent series if prometheusexporter TTL is low. We're currently using a TTL of 24h. If using cumulative span metrics it solves the counter reset problem and allows low prometheus TTL, but it's shifting the unbounded memory problem to the span metric cache, since it also only supports TTL. So adding the series limit in span metrics connector could be another solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants