-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods using ephemeral volumes stuck terminating after kubelet restart #1027
Comments
@shanecunningham Thanks for bringing this up. I am afraid I was not able to see any logs from the csi_attacher.go or nestedpendingoperations.go. Can you please check if you were using in-tree driver by any chance. Are those kubelet logs and can you please explain where you got those logs from? |
@nirmalaagash Thank you for looking into this. My logs were all from kubelet. I'm not using in-tree, I'm seeing this problem across both migrated volumes and new volumes created with the CSI driver. I tested the newest release v.1.2.0 and fix #1019 seems to partially fix it. If I manually restart kubelet on a node with a pod using a volume, the pod still hangs stuck on terminating. I see the same So this still affects volumes after a process that restarts all kubelets, such as a Kubernetes version upgrade. Can you replicate this or have any suggestions? |
@shanecunningham help me reproduce this issue. Can you please explain the steps in detail? (manifest that you used to create ephemeral volume, storage class, log level on each containers, how you collected those kubelet logs) |
@nirmalaagash Since I posted this issue I've confirmed it's happening on all volumes, not just ephemeral. But I'll outline steps for ephemeral volumes.
Logs level are default as far as I know, v=2 for driver containers and default for kubelet. I collected the logs by sshing to the worker node and using journalctl . |
Thanks @shanecunningham I'll look into it. |
@shanecunningham Like you mentioned, I could see the issue of pod stuck at terminating with v1.2.0. Please use the release-1.2 branch which has recent aws-ebs-csi-driver v1.2.1 image tag. Below is the screenshot of the pod not stuck at terminating using v1.2.1. Let me know if this works. |
@nirmalaagash Yesterday I tried using the k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.2.1 image by editing the daemonset and was still seeing the pods get stuck. Let me try redeploying using the helm chart, looks like it was just updated for 1.2.1, maybe I missed a change somewhere else. |
@shanecunningham You should edit both the deployment 'ebs-csi-controller' and daemonset 'ebs-csi-node' in kube-system namespace for the changes to reflect. Also, you can try it with the helm chart. |
@nirmalaagash Ah, I did forget the controllers. So I updated both to v1.2.1 and I can still reproduce this. After a kubelet restart there are the mounting errors being logged and the pod is never terminated until another kubelet restart.
These are the logs from ebs-plugin on this node.
Versions I'm running.
|
These logs are being repeated from the controllers ebs-plugin container.
|
@shanecunningham From your logs, I could see that the problem is in the NodeStageVolume. But I have few things to clarify. I am not sure why there are two device names, /dev/nvme2n1 in both of the mount outputs and in last before line there is another device name /dev/nvme1n1. So I am not sure if there are two devices involved. I would be happy to discuss it more on the slack channel with the same github username. |
@shanecunningham I am still having trouble reproducing the issue. I tried enabling the feature gate "GenericEphemeralVolume=true" for the ephemeral volume to work in v1.20.x Kubernetes Cluster and tried the deployment. I was facing the below error.
Please provide the configuration used in the cluster like feature gates that are enabled and any other configuration made specific to the cluster. |
this should be fixed by #1082 which just merged. The fix will be backported and released in v1.3.x |
/kind bug
What happened?
After a k8s in-place upgrade pods with ephemeral volumes were stuck in terminating state after a delete.
What you expected to happen?
Ephemeral volume pods to delete properly.
How to reproduce it (as minimally and precisely as possible)?
Deploy a pod using ephemeral volume, restart kubelet on the host, delete the pod.
Anything else we need to know?:
From what I can tell, the kubelet restart is what puts the pod/volume in a bad state which results in the pod getting stuck terminating.
After a kubelet restart, logs show a problem mounting. This volume is already used by the pod.
This error is repeated. Then a delete is issued for the pod. The volumeattachment is deleted.
Since the volumeattachment has beed deleted, GETs start to fail.
The previous log is repeated until kubelet is restarted. This allows the volume/pod to be cleaned up. However, I found if another ephemeral volume lands on this host, it runs into the same stuck terminating issue as before and requires another kubelet restart.
Environment
kubectl version
): 1.20.8The text was updated successfully, but these errors were encountered: