Detect "cascading evictions" of DaemonSet pods

When pods in a DaemonSet get repeatedly evicted (due to using too much RAM or ephemeral-storage), it seems there's currently no way to detect that situation.

* Humans looking at `kubectl get pods` output won't detect it because DaemonSets immediately delete Evicted pods
* `KubePodCrashLooping` won't detect the issue because the Evicted pods are deleted and new ones are created.
* `KubeDaemonSetRolloutStuck/KubeDaemonSetNotScheduled/KubeDaemonSetMisScheduled` won't detect the issue because all the pods are up-to-date and being scheduled just fine. Just that they get killed a few seconds after creation.

I think we should add an alert for this kind of situation, but unsure how to do it.

* We could get the cluster-wide eviction rate with `kubelet_evictions`, but it doesn't include the pod name or anything.
* `kube_daemonset_status_number_unavailable` has a chance of detecting the issue, but only if the evictions consistently happen around the 30-second interval when that is measured. In the two cascading-eviction events I've witnessed, only one of them had the evictions happening fast enough to show up on a graph consistently (still didn't trigger any existing alerts).

If we can't think of a better way to detect this, we be forced to change some of the upstream metrics better-detect evictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Detect "cascading evictions" of DaemonSet pods #759

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Detect "cascading evictions" of DaemonSet pods #759

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions