Skip to content

Detect "cascading evictions" of DaemonSet pods #759

@mac-chaffee

Description

@mac-chaffee

When pods in a DaemonSet get repeatedly evicted (due to using too much RAM or ephemeral-storage), it seems there's currently no way to detect that situation.

  • Humans looking at kubectl get pods output won't detect it because DaemonSets immediately delete Evicted pods
  • KubePodCrashLooping won't detect the issue because the Evicted pods are deleted and new ones are created.
  • KubeDaemonSetRolloutStuck/KubeDaemonSetNotScheduled/KubeDaemonSetMisScheduled won't detect the issue because all the pods are up-to-date and being scheduled just fine. Just that they get killed a few seconds after creation.

I think we should add an alert for this kind of situation, but unsure how to do it.

  • We could get the cluster-wide eviction rate with kubelet_evictions, but it doesn't include the pod name or anything.
  • kube_daemonset_status_number_unavailable has a chance of detecting the issue, but only if the evictions consistently happen around the 30-second interval when that is measured. In the two cascading-eviction events I've witnessed, only one of them had the evictions happening fast enough to show up on a graph consistently (still didn't trigger any existing alerts).

If we can't think of a better way to detect this, we be forced to change some of the upstream metrics better-detect evictions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions