-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes creates thousands of failed kube-cert-agent pods under certain conditions #1507
Comments
What would happen if the kube-cert-agent deployment did not make any cpu or memory requests? Would it always get placed onto the requested node, even when that node has no available CPU or memory? Could that perhaps be a workaround? What would be the downsides of not being in the guaranteed QoS tier anymore (see https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/)? One impact would be that in times of contention, the pod could be starved of CPU, since it made no request. Would having no request make it more susceptible to eviction, since it would always be exceeding its request (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)? How long do the Concierge pods cache (in memory) the data fetched from the kube cert agent? By caching in memory, they would be more resistant to the kube cert agent being temporarily starved of CPU during times of contention. However, if the Concierge pod were to be restarted (e.g. moved to a different node) then after the restart the pod would have lost its cache, so it would really need to the kube cert agent to be able to respond to requests. Would the kube cert agent controller need to handle this in a special way during upgrade, or are the CPU and memory requests editable on a Deployment? I will investigate these questions in the comments below and try to draw some conclusions. |
Confirmed on a kind cluster that using # Deployed this onto a Kind cluster with a single node.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deploy
namespace: default
spec:
replicas: 1
selector:
matchLabels:
test-deploy: v1
template:
metadata:
labels:
test-deploy: v1
spec:
containers:
- name: sleeper
image: debian:stable-slim
imagePullPolicy: IfNotPresent
command: [ sleep, infinity ]
resources:
requests:
# This is the exact amount of CPU remaining on my
# cluster's node, causing it to be 100% requested
# after this pod is scheduled.
cpu: 7050m
# After the above deployment was finished, then deployed this successfully.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deploy-with-nodename
namespace: default
spec:
replicas: 1
selector:
matchLabels:
test-deploy-with-nodename: v1
template:
metadata:
labels:
test-deploy-with-nodename: v1
spec:
# This is the name of the node on my cluster.
nodeName: pinniped-control-plane
containers:
- name: sleeper
image: debian:stable-slim
imagePullPolicy: IfNotPresent
command: [ sleep, infinity ]
# Note that there is no resources requests/limits here.
# The pod is started successfully despite the node's
# CPU already being 100% requested before this
# deployment starts. |
Notes regarding upgrade:
|
Considering my question from above:
The Concierge controller in It uses a rate limiting technique to avoid reloading the data after a successful load for 15 minutes. This is to prevent the controller from constantly contacting the kube cert agent pod unnecessarily and therefore wasting resources. After 15 minutes has elapsed, it will eventually try to reload the data. This is because the data might have changed compared to when it was cached in-memory previously. This data is not expected to change often (hardly ever, actually), so the 15 minute delay is acceptable. If the Concierge pod running the controller has previously successfully cached the data, then it will never remove it from the in-memory cache. It will try to overwrite/update it in the cache after about 15 mins, but if that attempt to get the new data fails for any reason, it will keep the old data in its cache. This makes the process fairly resistant to temporary hiccups which prevent it from updating the data from the kube cert agent pod, as long as it was previously successfully read once in the pod's lifetime. Because the data is cached in-memory permanently, the kube cert agent pod being temporarily starved of resources and unable to respond to requests should have minimal impact, unless that happens around the time that a Concierge pod is started or restarted. After pod startup, the Concierge pod will not have the data cached, and it will not be able to allow users to authenticate into the cluster until it is able to successfully fetch the data from the kube cert agent pod. |
Ok, so to summarize the above comments...
|
I'm going to note quick that in discussion @cfryanr pointed out that
|
Makes sense to me to reserve 0 CPU for the |
Acceptance steps:
|
What happened?
The Pinniped Concierge controller which automatically creates the kube-cert-agent Deployment on clusters where the control plane nodes are visible in the k8s API, will choose a control plane node and use
nodeName: control-plane-node-name
in the pod template spec of the Deployment. This is because it wants that pod to run on a control plane node and it wants to copy the same volume mounts that the control plane is using onto the new pod.This generally works fine. However, when the selected control plane node does not have enough cpu or memory capacity to schedule the pod based on the pod's requested cpu and memory (which is tiny but non-zero), then Kubernetes does something unexpected. The pod is created, it fails to be scheduled due to
OutOfcpu
, and then Kubernetes immediately tries again with no backoff. Within minutes, there are thousands of failed pods (at a rate of ~1000 pods created per ~7 minutes). Kubernetes does not clean up these pods either.This strange behavior of Kubernetes can be seen outside of the context of Pinniped as well. On a Kind cluster, create this Deployment to see the exact same behavior:
Immediately there will be many many pods created, and they are not cleaned up automatically. (When the Deployment is deleted, then all pods are also deleted.)
What did you expect to happen?
I would expect a more graceful failure mode for the kube-cert-agent Deployment where Kubernetes does not create thousands of dead pods when there is not enough capacity on the control plane node.
According to this related issue in the Kubernetes repo (see kubernetes/kubernetes#113907 (comment) and kubernetes/kubernetes#113907 (comment)):
Perhaps the Pinniped Concierge should not use the
nodeName
setting on the Deployment, but should use a different method to select the control plane node. If there is another method to select the node which does not trigger this strange behavior of Kubernetes, then that would be preferred. It would be better for the pod to get stuck inPending
state when it cannot be scheduled due to a lack of resources than to create thousands of pods.What else is there to know about this bug?
Any potential fix would need to consider upgrades. The kube-cert-agent Deployment may already exist on the cluster during an upgrade. The Pinniped Concierge controller would need to gracefully handle that case by updating the Deployment, or by deleting/recreating the Deployment if it needs to change any read-only fields of the Deployment as part of this solution.
The Kube cert agent controller, which creates the deployment, is currently using the following business logic:
"component": "kube-controller-manager"
in thekube-system
, pick the one with the newestCreationTimestamp
.NodeSelector
,NodeName
,Tolerations
(and some other stuff) from the pod chosen in the previous step, in an effort to cause the kube cert agent pod to get scheduled onto the same node as the pod from the previous step. Also copy itsVolumes
andVolumeMounts
so the new pod can read the same files.NodeSelector
is empty and theNodeName
is not, so we end up accidentally side-stepping the Kubernetes pod scheduler for the new pod.See https://github.com/vmware-tanzu/pinniped/blob/v0.23.0/internal/controller/kubecertagent/kubecertagent.go#L558-L562. Is there some way that we could maybe use a
NodeSelector
or node affinity and still be sure that the new pod will be able to mount the same volumes?The text was updated successfully, but these errors were encountered: