Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes creates thousands of failed kube-cert-agent pods under certain conditions #1507

Closed
cfryanr opened this issue Apr 18, 2023 · 8 comments
Assignees
Labels
bug Something isn't working state/accepted All done!

Comments

@cfryanr
Copy link
Member

cfryanr commented Apr 18, 2023

What happened?

The Pinniped Concierge controller which automatically creates the kube-cert-agent Deployment on clusters where the control plane nodes are visible in the k8s API, will choose a control plane node and use nodeName: control-plane-node-name in the pod template spec of the Deployment. This is because it wants that pod to run on a control plane node and it wants to copy the same volume mounts that the control plane is using onto the new pod.

This generally works fine. However, when the selected control plane node does not have enough cpu or memory capacity to schedule the pod based on the pod's requested cpu and memory (which is tiny but non-zero), then Kubernetes does something unexpected. The pod is created, it fails to be scheduled due to OutOfcpu, and then Kubernetes immediately tries again with no backoff. Within minutes, there are thousands of failed pods (at a rate of ~1000 pods created per ~7 minutes). Kubernetes does not clean up these pods either.

This strange behavior of Kubernetes can be seen outside of the context of Pinniped as well. On a Kind cluster, create this Deployment to see the exact same behavior:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deploy
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      test-deploy: v1
  template:
    metadata:
      labels:
        test-deploy: v1
    spec:
      nodeName: kind-control-plane # this is the name of a control plane node on the cluster
      containers:
        - name: sleeper
          image: debian:stable-slim
          imagePullPolicy: IfNotPresent
          command: [ sleep, infinity ]
          resources:
            requests:
              cpu: 8000m # this is more than is available on the control plane node at the time of deployment

Immediately there will be many many pods created, and they are not cleaned up automatically. (When the Deployment is deleted, then all pods are also deleted.)

What did you expect to happen?

I would expect a more graceful failure mode for the kube-cert-agent Deployment where Kubernetes does not create thousands of dead pods when there is not enough capacity on the control plane node.

According to this related issue in the Kubernetes repo (see kubernetes/kubernetes#113907 (comment) and kubernetes/kubernetes#113907 (comment)):

If you want to target a specific node and wait until the node has resources, a nodeSelector or node affinity that selects only the desired node is the way to accomplish that.

Perhaps the Pinniped Concierge should not use the nodeName setting on the Deployment, but should use a different method to select the control plane node. If there is another method to select the node which does not trigger this strange behavior of Kubernetes, then that would be preferred. It would be better for the pod to get stuck in Pending state when it cannot be scheduled due to a lack of resources than to create thousands of pods.

What else is there to know about this bug?

Any potential fix would need to consider upgrades. The kube-cert-agent Deployment may already exist on the cluster during an upgrade. The Pinniped Concierge controller would need to gracefully handle that case by updating the Deployment, or by deleting/recreating the Deployment if it needs to change any read-only fields of the Deployment as part of this solution.

The Kube cert agent controller, which creates the deployment, is currently using the following business logic:

  1. Out of all healthy controller manager pods (pods with label "component": "kube-controller-manager" in the kube-system, pick the one with the newest CreationTimestamp.
  2. On the new Deployment, copy the NodeSelector, NodeName, Tolerations (and some other stuff) from the pod chosen in the previous step, in an effort to cause the kube cert agent pod to get scheduled onto the same node as the pod from the previous step. Also copy its Volumes and VolumeMounts so the new pod can read the same files.
  3. The problem seems to be that the NodeSelector is empty and the NodeName is not, so we end up accidentally side-stepping the Kubernetes pod scheduler for the new pod.

See https://github.com/vmware-tanzu/pinniped/blob/v0.23.0/internal/controller/kubecertagent/kubecertagent.go#L558-L562. Is there some way that we could maybe use a NodeSelector or node affinity and still be sure that the new pod will be able to mount the same volumes?

@cfryanr cfryanr changed the title Kubernetes creates thousands of failed kuber-cert-agent pods under certain conditions Kubernetes creates thousands of failed kube-cert-agent pods under certain conditions Apr 18, 2023
@cfryanr
Copy link
Member Author

cfryanr commented Jun 23, 2023

What would happen if the kube-cert-agent deployment did not make any cpu or memory requests? Would it always get placed onto the requested node, even when that node has no available CPU or memory? Could that perhaps be a workaround?

What would be the downsides of not being in the guaranteed QoS tier anymore (see https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/)? One impact would be that in times of contention, the pod could be starved of CPU, since it made no request. Would having no request make it more susceptible to eviction, since it would always be exceeding its request (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)?

How long do the Concierge pods cache (in memory) the data fetched from the kube cert agent? By caching in memory, they would be more resistant to the kube cert agent being temporarily starved of CPU during times of contention. However, if the Concierge pod were to be restarted (e.g. moved to a different node) then after the restart the pod would have lost its cache, so it would really need to the kube cert agent to be able to respond to requests.

Would the kube cert agent controller need to handle this in a special way during upgrade, or are the CPU and memory requests editable on a Deployment?

I will investigate these questions in the comments below and try to draw some conclusions.

@cfryanr
Copy link
Member Author

cfryanr commented Jun 23, 2023

Confirmed on a kind cluster that using nodeName to place the pod onto a node which has its CPU 100% requested does not trigger this bug when the pod makes no CPU request.

# Deployed this onto a Kind cluster with a single node.
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deploy
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      test-deploy: v1
  template:
    metadata:
      labels:
        test-deploy: v1
    spec:
      containers:
        - name: sleeper
          image: debian:stable-slim
          imagePullPolicy: IfNotPresent
          command: [ sleep, infinity ]
          resources:
            requests:
              # This is the exact amount of CPU remaining on my
              # cluster's node, causing it to be 100% requested
              # after this pod is scheduled.
              cpu: 7050m

# After the above deployment was finished, then deployed this successfully.
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deploy-with-nodename
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      test-deploy-with-nodename: v1
  template:
    metadata:
      labels:
        test-deploy-with-nodename: v1
    spec:
      # This is the name of the node on my cluster.
      nodeName: pinniped-control-plane
      containers:
        - name: sleeper
          image: debian:stable-slim
          imagePullPolicy: IfNotPresent
          command: [ sleep, infinity ]
          # Note that there is no resources requests/limits here.
          # The pod is started successfully despite the node's
          # CPU already being 100% requested before this
          # deployment starts.

@cfryanr
Copy link
Member Author

cfryanr commented Jun 23, 2023

Notes regarding upgrade:

  • When using kubectl apply, Kubernetes does allow a Deployment's pod spec's resource requests and limits to be updated. So it seems that the Concierge controller would not need to do anything special to handle updating this field.
  • When a Deployment's pod spec is updated, it tries to start a new pod before it terminates the old pod. So if the Deployment is requesting CPU or memory on that new pod, there may not be available capacity on the node to start the extra pod before it terminates the old pod. This will cause the new pod to get stuck in a pending state, and the Deployment will never progress. However, this should not be a problem if we are removing both the CPU and memory requests from the Deployment's pod spec, because the new pod will not request any resources and therefore will not be backed from starting (even while the old pod is still holding its resources).
  • Note that a pod with no explicit CPU request, with a CPU limit, will get an automatic CPU request which is equal to its limit. The same is true for memory requests/limits. Having this automatic CPU request can cause the same bug as when the request is explicit. So having no request with an explicit limit does not help this bug. However, it appears that having an explicit request of 0 with an explicit limit does not get any automatic request (at least when using kubectl apply). The request is left as zero, avoiding this bug. So we could keep the limits if we wanted to, as long as we are careful to explicitly set the requests to zero. We would need to check that the same happens when the Deployment is created by golang (update: I confirmed that this works from the golang controller code too, it is able to update the Deployment's pod spec's requests during an upgrade). Having limits should never stop the pod from being scheduled, since the node will allow the limits to be overcommitted (total more than 100%). Having a request of zero with an explicit limit assigns the pod to QoS class "Burstable". On the other hand, having no limits and no requests assigns the pod to QoS class "BestEffort". "BestEffort" pods are evicted before "Burstable" pods, so "Burstable" is considered higher priority than "BestEffort" in that sense (see https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/). So it might be best to keep the limits on this pod.

@cfryanr
Copy link
Member Author

cfryanr commented Jun 23, 2023

Considering my question from above:

How long do the Concierge pods cache (in memory) the data fetched from the kube cert agent? By caching in memory, they would be more resistant to the kube cert agent being temporarily starved of CPU during times of contention. However, if the Concierge pod were to be restarted (e.g. moved to a different node) then after the restart the pod would have lost its cache, so it would really need to the kube cert agent to be able to respond to requests.

The Concierge controller in kubecertagent.go will load the data from the kube cert agent pod.

It uses a rate limiting technique to avoid reloading the data after a successful load for 15 minutes. This is to prevent the controller from constantly contacting the kube cert agent pod unnecessarily and therefore wasting resources. After 15 minutes has elapsed, it will eventually try to reload the data. This is because the data might have changed compared to when it was cached in-memory previously. This data is not expected to change often (hardly ever, actually), so the 15 minute delay is acceptable.

If the Concierge pod running the controller has previously successfully cached the data, then it will never remove it from the in-memory cache. It will try to overwrite/update it in the cache after about 15 mins, but if that attempt to get the new data fails for any reason, it will keep the old data in its cache. This makes the process fairly resistant to temporary hiccups which prevent it from updating the data from the kube cert agent pod, as long as it was previously successfully read once in the pod's lifetime.

Because the data is cached in-memory permanently, the kube cert agent pod being temporarily starved of resources and unable to respond to requests should have minimal impact, unless that happens around the time that a Concierge pod is started or restarted. After pod startup, the Concierge pod will not have the data cached, and it will not be able to allow users to authenticate into the cluster until it is able to successfully fetch the data from the kube cert agent pod.

@cfryanr
Copy link
Member Author

cfryanr commented Jun 23, 2023

Ok, so to summarize the above comments...

  • One possible fix is to simply change these two lines of code to explicitly request "0": https://github.com/vmware-tanzu/pinniped/blob/v0.24.0/internal/controller/kubecertagent/kubecertagent.go#L550-L551.
  • This should prevent this bug. The pod will get scheduled even if there are no available resources on the node to request.
  • This should work for both upgrades and fresh installs.
  • By keeping the resource limits, the pod will be assigned QoS class "Burstable", which is more susceptible to eviction than it used to be, but less susceptible than if it had no limits.
  • If the pod gets temporarily starved for resources, it will not effect the user experience, unless that happens while a Concierge pod is just starting, in which case users will temporarily not be able to authenticate to the cluster.

@benjaminapetersen
Copy link
Member

benjaminapetersen commented Jul 20, 2023

I'm going to note quick that in discussion @cfryanr pointed out that Request is really Reserve, so request: 0 just means "reserve no cpu for this pod". This makes sense, since this pod runs a command very rarely, so it does not need constant cpu.

@pinniped-ci-bot pinniped-ci-bot added enhancement New feature or request priority/backlog Prioritized for an upcoming iteration bug Something isn't working and removed enhancement New feature or request labels Jul 20, 2023
@joshuatcasey
Copy link
Member

Makes sense to me to reserve 0 CPU for the kube-cert-agent.

@pinniped-ci-bot pinniped-ci-bot added state/started Someone is working on it currently state/finished Code finished but not yet delivered state/delivered Ready for manual acceptance review and removed state/started Someone is working on it currently state/finished Code finished but not yet delivered labels Jul 27, 2023
@cfryanr
Copy link
Member Author

cfryanr commented Aug 2, 2023

Acceptance steps:

  1. Create a kind cluster with 1 control-plane node and 1 worker node.

  2. kubectl describe node -l 'node-role.kubernetes.io/control-plane=' and note the amount of "Allocatable" CPU
    minus the amount of "Allocated" CPU "Requests". On my computer, the allocatable CPU is 8 (equal to 8000m)
    and the allocated is 950m. The difference is 7050m.

  3. Create a Deployment which requests all the remaining CPU on the control-plane node. E.g.:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-deploy
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          test-deploy: v1
      template:
        metadata:
          labels:
            test-deploy: v1
        spec:
          nodeName: pinniped-control-plane # this is the name of a control plane node on the cluster
          containers:
            - name: sleeper
              image: debian:stable-slim
              imagePullPolicy: IfNotPresent
              command: [ sleep, infinity ]
              resources:
                requests:
                  cpu: 7050m # this is the exact amount available on the control plane node at the time of deployment
  4. kubectl apply the Deployment.

  5. Describe the control plane node again. Now the "Allocated" CPU "Requests" should be exactly equal to the
    "Allocatable" CPU. The node is completely full.

  6. Now install an old version of the Pinniped Concierge which suffered from this bug:
    kapp deploy --app pinniped-concierge --file https://get.pinniped.dev/v0.24.0/install-pinniped-concierge.yaml

  7. The app should install because there is plenty of room on the worker node. However, kubectl get pods -A should
    show hundreds of pinniped-concierge-kube-cert-agent-* pods within a few minutes. This reproduces the bug.

  8. Remove the old version of the Concierge: kapp delete --app pinniped-concierge --yes

  9. Confirm that kubectl get pods -A should show that all the kube cert agent pods, and other Concierge pods, are gone.

  10. Now build and install a version of the Concierge from the HEAD of the main branch. One easy way to do that is
    to run ./hack/prepare-for-integration-tests.sh which will compile and deploy the Concierge (and other stuff).

  11. kubectl get pods -A again and see that the pinniped-concierge-kube-cert-agent-* pod is successfully running
    on the control plane node.

  12. Follow the directions from https://pinniped.dev/docs/tutorials/concierge-only-demo/ to create a user and a
    WebhookAuthenticator and try logging in. Note that the Concierge and the local-user-authenticator is already
    installed, so you can skip those steps from the guide.

  13. By being able to authenticate as the pinny user, you have proved that the kube-cert-agent pod is functioning
    as needed to enable the capabilities of the Concierge.

@pinniped-ci-bot pinniped-ci-bot added state/accepted All done! and removed priority/backlog Prioritized for an upcoming iteration state/delivered Ready for manual acceptance review labels Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working state/accepted All done!
Projects
None yet
Development

No branches or pull requests

4 participants