Closed
Description
I just slowly incremented prometheus-server to 20 GB of memory requests for pangeo-hubs
cluster. It appears that it wasn't sufficient with 18 GB, because it peaked at close to 19 GB before it fell down to ~3-4GB when Head GC completed
was logged ~5 minutes after startup.
This prometheus-server had a /data
folder mounted from the attached PVC that was 5.8 GB.
kubectl exec -n support deploy/support-prometheus-server -c prometheus-server -- du -sh /data
5.8G /data
The problem we have is that "write-ahead log" (WAL) is being read during startup from the disk to summarize all metrics collected as I understand it, and that takes a lot of memory. Actually, the problem is that we can't know what this memory requirement is, because it grows over time as more metrics are collected.
Ideas
- We upper-bound the WAL size on disk instead of collected metrics age
- We work towards node sharing (New default machine types and profile list options - sharing nodes is great! #2121) so we get less metrics from nodes
- We try to limit the metrics collected by prometheus to what we consume
Example on logs from a successfull startup
ts=2023-02-17T08:08:58.378Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25167 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25168 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25169 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:720 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=1.080338716s wal_replay_duration=1m28.600482965s wbl_replay_duration=184ns total_replay_duration=1m30.245179793s
ts=2023-02-17T08:09:06.733Z caller=main.go:1014 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-02-17T08:09:06.733Z caller=main.go:1017 level=info msg="TSDB started"
ts=2023-02-17T08:09:06.733Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/config/prometheus.yml
ts=2023-02-17T08:09:06.766Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.769Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.770Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml totalDuration=37.243136ms db_storage=3µs remote_storage=2.821µs web_handler=958ns query_engine=1.917µs scrape=30.766642ms scrape_sd=3.27419ms notify=2.525µs notify_sd=4.572µs rules=574.073µs tracing=8.964µs
ts=2023-02-17T08:09:06.770Z caller=main.go:978 level=info msg="Server is ready to receive web requests."
ts=2023-02-17T08:09:06.770Z caller=manager.go:953 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-02-17T08:11:56.487Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1674864000000 maxt=1674871200000 ulid=01GSF6Q33ZV1SA1YCH087EP9S2 duration=2m44.423969816s
ts=2023-02-17T08:12:15.236Z caller=head.go:1213 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=18.734578757s
Related
- Keep prometheus data for 1 year rather than 90 days #1779
- pangeo-hubs, prometheus: server is crashing due to memory limits #2215
- LEAP prometheus server is down/scheduler faiiling #2248 (200+ nodes -> many node exporters -> a lot of scraping -> many calico-typha -> little available CPU)
- Overview of grafana and prometheus related issues #2214
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment