Skip to content

Ideas to upper bound prometheus-server's memory consumption #2222

Closed

Description

I just slowly incremented prometheus-server to 20 GB of memory requests for pangeo-hubs cluster. It appears that it wasn't sufficient with 18 GB, because it peaked at close to 19 GB before it fell down to ~3-4GB when Head GC completed was logged ~5 minutes after startup.

This prometheus-server had a /data folder mounted from the attached PVC that was 5.8 GB.

kubectl exec -n support deploy/support-prometheus-server -c prometheus-server -- du -sh /data 
5.8G	/data

The problem we have is that "write-ahead log" (WAL) is being read during startup from the disk to summarize all metrics collected as I understand it, and that takes a lot of memory. Actually, the problem is that we can't know what this memory requirement is, because it grows over time as more metrics are collected.

Ideas

  1. We upper-bound the WAL size on disk instead of collected metrics age
  2. We work towards node sharing (New default machine types and profile list options - sharing nodes is great! #2121) so we get less metrics from nodes
  3. We try to limit the metrics collected by prometheus to what we consume

Example on logs from a successfull startup

ts=2023-02-17T08:08:58.378Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25167 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25168 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25169 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:720 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=1.080338716s wal_replay_duration=1m28.600482965s wbl_replay_duration=184ns total_replay_duration=1m30.245179793s
ts=2023-02-17T08:09:06.733Z caller=main.go:1014 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-02-17T08:09:06.733Z caller=main.go:1017 level=info msg="TSDB started"
ts=2023-02-17T08:09:06.733Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/config/prometheus.yml
ts=2023-02-17T08:09:06.766Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.769Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.770Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml totalDuration=37.243136ms db_storage=3µs remote_storage=2.821µs web_handler=958ns query_engine=1.917µs scrape=30.766642ms scrape_sd=3.27419ms notify=2.525µs notify_sd=4.572µs rules=574.073µs tracing=8.964µs
ts=2023-02-17T08:09:06.770Z caller=main.go:978 level=info msg="Server is ready to receive web requests."
ts=2023-02-17T08:09:06.770Z caller=manager.go:953 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-02-17T08:11:56.487Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1674864000000 maxt=1674871200000 ulid=01GSF6Q33ZV1SA1YCH087EP9S2 duration=2m44.423969816s
ts=2023-02-17T08:12:15.236Z caller=head.go:1213 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=18.734578757s

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions