-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets. If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.
|
That suggestion helps a lot. We're using OTLP to push cumulative metrics to the collector. I didn't realize the OTEL SDK sends a data point even if it hasn't been updated - so we don't have to worry about excessive counter resets. Combining that with the SDK cardinality limit should protect against label cardinality explosions The use-case i'm not sure about is span metrics sent to |
Component(s)
exporter/prometheus
Is your feature request related to a problem? Please describe.
There's currently no limit on the number of metric series cached by this component. This increases the risk of destabilizing collectors during a cardinality explosion or increase in cardinality because of high memory
The cache
metric_expiration
/TTL option is useful but its hard to tune it correctly. If the expiration is too short you get a lot of counter resets which seems to decrease usability with promql, and too long and you increase the risk of high memoryIn our prometheus set-up, we configure a sample_limit on collector scrape targets to reduce the impact of a cardinality explosion on prometheus health and on cost. Ideally we could set the collector cache's max size slightly higher than the
sample_limit
Describe the solution you'd like
An optional configuration to limit the size of this component's metric cache. When the capacity is exceeded, maybe it prioritizes discarding datapoints with the oldest timestamps. Probably temporarily exceeding the cache size is okay if it makes the implementation easier or more performant, similar to how expired items aren't purged until the next scrape
Describe alternatives you've considered
No response
Additional context
I can take a stab at implementing this. At first glance golang-lru might allow for the "last updated" purging strategy but would need to dig deeper into thread safety and performance
The text was updated successfully, but these errors were encountered: