Ideas to upper bound prometheus-server's memory consumption

I just slowly incremented prometheus-server to 20 GB of memory requests for `pangeo-hubs` cluster. It appears that it wasn't sufficient with 18 GB, because it peaked at close to 19 GB before it fell down to ~3-4GB when `Head GC completed` was logged ~5 minutes after startup.

This prometheus-server had a `/data` folder mounted from the attached PVC that was 5.8 GB.

```shell
kubectl exec -n support deploy/support-prometheus-server -c prometheus-server -- du -sh /data 
5.8G	/data
```

The problem we have is that "write-ahead log" (WAL) is being read during startup from the disk to summarize all metrics collected as I understand it, and that takes a lot of memory. Actually, the problem is that we can't know what this memory requirement is, because it grows over time as more metrics are collected.

### Ideas

1. We upper-bound the WAL size on disk instead of collected metrics age
2. We work towards node sharing (#2121) so we get less metrics from nodes
3. We try to limit the metrics collected by prometheus to what we consume

### Example on logs from a successfull startup

```
ts=2023-02-17T08:08:58.378Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25167 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25168 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25169 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:720 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=1.080338716s wal_replay_duration=1m28.600482965s wbl_replay_duration=184ns total_replay_duration=1m30.245179793s
ts=2023-02-17T08:09:06.733Z caller=main.go:1014 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-02-17T08:09:06.733Z caller=main.go:1017 level=info msg="TSDB started"
ts=2023-02-17T08:09:06.733Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/config/prometheus.yml
ts=2023-02-17T08:09:06.766Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.769Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.770Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml totalDuration=37.243136ms db_storage=3µs remote_storage=2.821µs web_handler=958ns query_engine=1.917µs scrape=30.766642ms scrape_sd=3.27419ms notify=2.525µs notify_sd=4.572µs rules=574.073µs tracing=8.964µs
ts=2023-02-17T08:09:06.770Z caller=main.go:978 level=info msg="Server is ready to receive web requests."
ts=2023-02-17T08:09:06.770Z caller=manager.go:953 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-02-17T08:11:56.487Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1674864000000 maxt=1674871200000 ulid=01GSF6Q33ZV1SA1YCH087EP9S2 duration=2m44.423969816s
ts=2023-02-17T08:12:15.236Z caller=head.go:1213 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=18.734578757s
```

### Related

- #1779
- #2215
- #2248 (200+ nodes -> many node exporters -> a lot of scraping -> many calico-typha -> little available CPU)
- #2214

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas to upper bound prometheus-server's memory consumption #2222

consideRatio
openedon Feb 17, 2023

Ideas

Example on logs from a successfull startup

Related

Assignees

Labels

Type

Projects

Milestone

Relationships

Development