Skip to content

kubelet.yaml set cgroupDriver: systemd instead of cgroupDriver: cgroupfs in AL2-GPU instances #3005

@DanielAmmar

Description

@DanielAmmar

What happened?
launch unmanaged node group with p3.2xlarge gpu (ami-0f23f1b20f58cc97f)
however it failed to start -

systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-eksclt.al2.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2020-12-30 14:16:36 UTC; 4s ago
     Docs: https://github.com/kubernetes/kubernetes
  Process: 22376 ExecStart=/usr/bin/kubelet --node-ip=${NODE_IP} --node-labels=${NODE_LABELS},alpha.eksctl.io/instance-id=${INSTANCE_ID} --max-pods=${MAX_PODS} --register-node=true --register-with-taints=${NODE_TAINTS} --cloud-provider=aws --container-runtime=docker --network-plugin=cni --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --pod-infra-container-image=${AWS_EKS_ECR_ACCOUNT}.dkr.ecr.${AWS_DEFAULT_REGION}.${AWS_SERVICES_DOMAIN}/eks/pause:3.3-eksbuild.1 --kubeconfig=/etc/eksctl/kubeconfig.yaml --config=/etc/eksctl/kubelet.yaml (code=exited, status=255)
  Process: 22365 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
 Main PID: 22376 (code=exited, status=255)

error message:
failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs

cat /etc/eksctl/kubelet.yaml points that cgroupDriver: systemd however I suspect it should be cgroupDriver: cgroupfs

docker cgroup in Amazon Linux 2 (GPU) is set to "cgroupfs" (vs. "systemd" in non GPU versions)

How to reproduce it?
launch gpu group node via eksctl v0.35.0

Anything else we need to know?
What OS are you using, are you using a downloaded binary or did you compile eksctl, what type of AWS credentials are you using (i.e. default/named profile, MFA) - please don't include actual credentials though!

Versions

$ eksctl version
0.35.0

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Addiional info
I also tried to set an old GPU AMI version = "ami-0969f51a73874a795" (and even unset) - the same disappointing result.
When manually changing /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf
to include --cgroup-driver=cgroupfs and restart the service I could see the node registered successfully to my cluster.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions