Skip to content

Fix backend-listen memory alert query (3x metric triple-counting) #5372

@beastoin

Description

@beastoin

Parent: #5371

Problem

The Grafana alert "Backend-listen Memory usage is high" triple-counts container_memory_working_set_bytes because 3 Prometheus kubelet scrape services report identical cAdvisor metrics:

  1. prod-kube-prometheus-stack-kubelet
  2. dg-prometheus-stack-kubelet (leftover from DG self-hosted Helm)
  3. prod-omi-kube-prometheus-s-kubelet

The alert query does sum by (pod) on the numerator (memory usage) which sums all 3 sources, but the denominator (limits from kube-state-metrics) has only 1 source. Result: ~3x inflated utilization.

Fix

1. Alert query fix (immediate)

Add service="prod-kube-prometheus-stack-kubelet" to the numerator:

Current:

sum(container_memory_working_set_bytes{namespace="prod-omi-backend", container!="", image!=""} * on(namespace,pod) group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{...}) by (pod)

Fixed:

sum(container_memory_working_set_bytes{namespace="prod-omi-backend", container!="", image!="", service="prod-kube-prometheus-stack-kubelet"} * on(namespace,pod) group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{...}) by (pod)

2. Deduplicate Prometheus scrapes (follow-up)

Investigate and remove the duplicate kubelet scrape configs:

  • dg-prometheus-stack-kubelet — likely installed with Deepgram self-hosted Helm chart
  • prod-omi-kube-prometheus-s-kubelet — likely a second prometheus-stack install

Note: This triple-counting likely affects ALL memory/CPU alerts across the cluster, not just backend-listen.

Verification

After fix, the hottest pod should report ~27% (not ~82%).

Driver: @mon-agent
CC: @thaingnguyen

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions