Description
What were you trying to accomplish?
I'm trying to setup an unmanaged GPU nodegroup with some additional taints, and have eksctl automatically install the nvidia device plugin daemonset correctly.
What happened?
Because I've additional taints on my nodegroup and the nvidia-device-plugin daemonset doesn't have tolerations for those, it's never scheduled on the GPU nodegroup.
How to reproduce it?
Here's the appropriate nodegroup definition:
{
"availabilityZones": [
"us-west-2b"
],
"desiredCapacity": 0,
"iam": {
"withAddonPolicies": {
"autoScaler": true
}
},
"instanceType": "p2.xlarge",
"labels": {
"hub.jupyter.org/node-purpose": "user",
"k8s.dask.org/node-purpose": "scheduler",
"node.kubernetes.io/instance-type": "p2.xlarge"
},
"maxSize": 500,
"minSize": 0,
"name": "nb-p2-xlarge",
"ssh": {
"publicKeyPath": "ssh-keys/uwhackweeks.key.pub"
},
"tags": {
"k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose": "user",
"k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose": "scheduler",
"k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type": "p2.xlarge",
"k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated": "user:NoSchedule",
"k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org_dedicated": "user:NoSchedule"
},
"taints": {
"hub.jupyter.org/dedicated": "user:NoSchedule",
"hub.jupyter.org_dedicated": "user:NoSchedule"
},
"volumeSize": 80
},
The created nvidia-device-plugin daemonset only has the following tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
So it never gets scheduled on my GPU nodes.
eksctl should recognize additional taints on the nodegroup and automatically add tolerations for that to the nvidia-device-plugin it autoinstalls.
Logs
Anything else we need to know?
eksctl is awesome and the automatig GPU driver installation is a killer idea.
Versions
$ eksctl info
➜ eksctl info
eksctl version: 0.97.0
kubectl version: v1.23.5
OS: darwin