Metrics never being cleaned up generates Memory and CPU performance issues #44

mrzacarias · 2020-08-06T15:11:25Z

PAG doesn't keep the track of all metrics in memory, like the normal prom gateway, but just the last version of the merged metric. That reduces the memory usage and makes it possible to use it to handle heavy metrics input loads, like the ones generated from browser-side apps.

Even with that "merge and keep the last value" optimization, as the metrics are never cleaned up, considering time < infinite, PAG will eventually deplete the MEM/CPU resourcing and blow up, as it happened a couple of times in my company. As we have cortex keeping track of metrics, PAG getting restarted every now and then is not a huge problem, but before blowing up we have an increase on the number of "bad requests", which makes us lose some good metrics while it doesn't restart.

There's no need to keep the metrics always there on PAG, as they are constantly scraped and stored on Prometheus or Cortex long living storage, so we should have a way to detect and remove old metrics from memory.

mrzacarias mentioned this issue Aug 6, 2020

Clean up old metrics using metrics timestamp #45

Open

xamaterasux mentioned this issue Jul 4, 2024

feature → setting ttl to clean up old metrics zapier/prom-aggregation-gateway#87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics never being cleaned up generates Memory and CPU performance issues #44

Metrics never being cleaned up generates Memory and CPU performance issues #44

mrzacarias commented Aug 6, 2020

Metrics never being cleaned up generates Memory and CPU performance issues #44

Metrics never being cleaned up generates Memory and CPU performance issues #44

Comments

mrzacarias commented Aug 6, 2020