-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
Parent: #5371
Problem
The Grafana alert "Backend-listen Memory usage is high" triple-counts container_memory_working_set_bytes because 3 Prometheus kubelet scrape services report identical cAdvisor metrics:
prod-kube-prometheus-stack-kubeletdg-prometheus-stack-kubelet(leftover from DG self-hosted Helm)prod-omi-kube-prometheus-s-kubelet
The alert query does sum by (pod) on the numerator (memory usage) which sums all 3 sources, but the denominator (limits from kube-state-metrics) has only 1 source. Result: ~3x inflated utilization.
Fix
1. Alert query fix (immediate)
Add service="prod-kube-prometheus-stack-kubelet" to the numerator:
Current:
sum(container_memory_working_set_bytes{namespace="prod-omi-backend", container!="", image!=""} * on(namespace,pod) group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{...}) by (pod)
Fixed:
sum(container_memory_working_set_bytes{namespace="prod-omi-backend", container!="", image!="", service="prod-kube-prometheus-stack-kubelet"} * on(namespace,pod) group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{...}) by (pod)
2. Deduplicate Prometheus scrapes (follow-up)
Investigate and remove the duplicate kubelet scrape configs:
dg-prometheus-stack-kubelet— likely installed with Deepgram self-hosted Helm chartprod-omi-kube-prometheus-s-kubelet— likely a second prometheus-stack install
Note: This triple-counting likely affects ALL memory/CPU alerts across the cluster, not just backend-listen.
Verification
After fix, the hottest pod should report ~27% (not ~82%).
Driver: @mon-agent
CC: @thaingnguyen
Reactions are currently unavailable