-
Couldn't load subscription status.
- Fork 630
Closed as not planned
Labels
Description
When pods in a DaemonSet get repeatedly evicted (due to using too much RAM or ephemeral-storage), it seems there's currently no way to detect that situation.
- Humans looking at
kubectl get podsoutput won't detect it because DaemonSets immediately delete Evicted pods KubePodCrashLoopingwon't detect the issue because the Evicted pods are deleted and new ones are created.KubeDaemonSetRolloutStuck/KubeDaemonSetNotScheduled/KubeDaemonSetMisScheduledwon't detect the issue because all the pods are up-to-date and being scheduled just fine. Just that they get killed a few seconds after creation.
I think we should add an alert for this kind of situation, but unsure how to do it.
- We could get the cluster-wide eviction rate with
kubelet_evictions, but it doesn't include the pod name or anything. kube_daemonset_status_number_unavailablehas a chance of detecting the issue, but only if the evictions consistently happen around the 30-second interval when that is measured. In the two cascading-eviction events I've witnessed, only one of them had the evictions happening fast enough to show up on a graph consistently (still didn't trigger any existing alerts).
If we can't think of a better way to detect this, we be forced to change some of the upstream metrics better-detect evictions.