Skip to content

Latest commit

 

History

History
182 lines (151 loc) · 7.08 KB

csi-debug.md

File metadata and controls

182 lines (151 loc) · 7.08 KB

CSI driver troubleshooting guide

Case#1: volume create/delete issue

This step is not applicable if you are using managed CSI driver on AKS.

  • find csi driver controller pod

There could be multiple controller pods (only one pod is the leader), if there are no helpful logs, try to get logs from the leader controller pod.

kubectl get po -o wide -n kube-system | grep csi-blob-controller
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE
csi-blob-controller-56bfddd689-dh5tk       4/4     Running   0          35s     10.240.0.19    k8s-agentpool-22533604-0
csi-blob-controller-56bfddd689-sl4ll       4/4     Running   0          35s     10.240.0.23    k8s-agentpool-22533604-1
  • get pod description and logs
kubectl describe pod csi-blob-controller-56bfddd689-dh5tk -n kube-system > csi-blob-controller-description.log
kubectl logs csi-blob-controller-56bfddd689-dh5tk -c blob -n kube-system > csi-blob-controller.log

Case#2: volume mount/unmount failed

  • locate csi driver pod and make sure which pod does the actual volume mount/unmount
kubectl get po -o wide -n kube-system | grep csi-blob-node
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE
csi-blob-node-cvgbs                        3/3     Running   0          7m4s    10.240.0.35    k8s-agentpool-22533604-1
csi-blob-node-dr4s4                        3/3     Running   0          7m4s    10.240.0.4     k8s-agentpool-22533604-0
  • get pod description and logs
kubectl describe pod csi-blob-node-cvgbs -n kube-system > csi-blob-node-description.log
kubectl logs csi-blob-node-cvgbs -c blob -n kube-system > csi-blob-node.log

note: to watch logs in realtime from multiple csi-blob-node DaemonSet pods simultaneously, run the command:

kubectl logs daemonset/csi-blob-node -c blob -n kube-system -f

get blobfuse-proxy logs on the node

journalctl -u blobfuse-proxy -l
  • check blobfuse mount inside driver
kubectl exec -it csi-blob-node-9vl9t -c blob -n kube-system -- mount | grep blobfuse
blobfuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-efce16db-bf15-4634-b82b-068385019d7c/globalmount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
blobfuse on /var/lib/kubelet/pods/e73d0984-a253-4203-9e8c-9237ae5c55d5/volumes/kubernetes.io~csi/pvc-efce16db-bf15-4634-b82b-068385019d7c/mount type fuse (rw,relatime,user_id=0,group_id=0,allow_other)
  • check nfs mount inside driver
kubectl exec -it csi-blob-node-9vl9t -n kube-system -c blob -- mount | grep nfs
accountname.file.core.windows.net:/accountname/pvcn-46c357b2-333b-4c42-8a7f-2133023d6c48 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-46c357b2-333b-4c42-8a7f-2133023d6c48/globalmount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.244.0.6,local_lock=none,addr=20.150.29.168)
accountname.file.core.windows.net:/accountname/pvcn-46c357b2-333b-4c42-8a7f-2133023d6c48 on /var/lib/kubelet/pods/7994e352-a4ee-4750-8cb4-db4fcf48543e/volumes/kubernetes.io~csi/pvc-46c357b2-333b-4c42-8a7f-2133023d6c48/mount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.244.0.6,local_lock=none,addr=20.150.29.168)

Update driver version quickly by editing driver deployment directly

  • update controller deployment
kubectl edit deployment csi-blob-controller -n kube-system
  • update daemonset deployment
kubectl edit ds csi-blob-node -n kube-system

change below deployment config, e.g.

        image: mcr.microsoft.com/k8s/csi/blob-csi:v1.4.0
        imagePullPolicy: Always

get blobfuse driver version on the node

blobfuse2 -v
blobfuse2 version 2.1.1

check blobfuse mount on the agent node

mount | grep blobfuse | uniq
  • Troubleshooting blobfuse mount failure on the agent node

troubleshooting connection failure on agent node

  • blobfuse

To check if blobfuse mount would work on the agent node, run the following commands to verify that the storage account name, key, and container name are correct. If any of these are incorrect, the blobfuse mount will fail:

mkdir test
export AZURE_STORAGE_ACCOUNT=
export AZURE_STORAGE_ACCESS_KEY=
# only for sovereign cloud
# export AZURE_STORAGE_BLOB_ENDPOINT=accountname.blob.core.chinacloudapi.cn
blobfuse2 test --container-name=CONTAINER-NAME --tmp-path=/tmp/blobfuse -o allow_other --file-cache-timeout-in-seconds=120

You can find more detailed information about environment variables at https://github.com/Azure/azure-storage-fuse#environment-variables.

  • NFSv3
mkdir /tmp/test
mount -v -t nfs -o sec=sys,vers=3,nolock accountname.blob.core.windows.net:/accountname/container-name /tmp/test
Get client-side logs on AKS Linux node if there is mount error
# get ama-logs pod which is running on the AKS Linux node
kubectl get po -n kube-system -o wide | grep ama-logs
# get blobfuse logs
kubectl -n kube-system cp ama-logs-xxxx:/var/log/blobfuse.log /tmp/blobfuse.log
# get blobfuse2 logs
kubectl -n kube-system cp ama-logs-xxxx:/var/log/blobfuse2.log /tmp/blobfuse2.log
Get client-side logs on Linux node if there is mount error
kubectl debug node/node-name --image=nginx
# get blobfuse logs
kubectl cp node-debugger-node-name-xxxx:/host/var/log/blobfuse.log /tmp/blobfuse.log
# get blobfuse2 logs
kubectl cp node-debugger-node-name-xxxx:/host/var/log/blobfuse2.log /tmp/blobfuse2.log
#after log collected, delete the debug pod by:
kubectl delete po node-debugger-node-name-xxxx

Troubleshooting aznfs mount

Supported from v1.22.2 About aznfs mount helper: https://github.com/Azure/AZNFS-mount/

Check mount point information
kubectl debug node/node-name --image=nginx
findmnt -t nfs

The SOURCE of the mount point should have prefix with an ip address ranther than domain namon e.g, 10.161.100.100:/nfs02a796c105814dbebc4e/pvc-ca149059-6872-4d6f-a806-48402648110c.

Get client-side logs on Linux node
kubectl debug node/node-name --image=nginx

cat /opt/microsoft/aznfs/data/aznfs.log

If ip was migrated successfully, you should find logs like:

  1. IP for nfsxxxxx.blob.core.windows.net changed [1.2.3.4 -> 5.6.7.8].
  2. Updating mountmap entry [nfsxxxxx.blob.core.windows.net 10.161.100.100 1.2.3.4 -> nfsxxxxx.blob.core.windows.net 10.161.100.100 5.6.7.8]

Tips