Skip to content

[Feature] nvidia device plugin not scheduled on GPU nodes if they have additional taints #5277

Closed
@yuvipanda

Description

@yuvipanda

What were you trying to accomplish?

I'm trying to setup an unmanaged GPU nodegroup with some additional taints, and have eksctl automatically install the nvidia device plugin daemonset correctly.

What happened?

Because I've additional taints on my nodegroup and the nvidia-device-plugin daemonset doesn't have tolerations for those, it's never scheduled on the GPU nodegroup.

How to reproduce it?

Here's the appropriate nodegroup definition:

      {
         "availabilityZones": [
            "us-west-2b"
         ],
         "desiredCapacity": 0,
         "iam": {
            "withAddonPolicies": {
               "autoScaler": true
            }
         },
         "instanceType": "p2.xlarge",
         "labels": {
            "hub.jupyter.org/node-purpose": "user",
            "k8s.dask.org/node-purpose": "scheduler",
            "node.kubernetes.io/instance-type": "p2.xlarge"
         },
         "maxSize": 500,
         "minSize": 0,
         "name": "nb-p2-xlarge",
         "ssh": {
            "publicKeyPath": "ssh-keys/uwhackweeks.key.pub"
         },
         "tags": {
            "k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose": "user",
            "k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose": "scheduler",
            "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type": "p2.xlarge",
            "k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated": "user:NoSchedule",
            "k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org_dedicated": "user:NoSchedule"
         },
         "taints": {
            "hub.jupyter.org/dedicated": "user:NoSchedule",
            "hub.jupyter.org_dedicated": "user:NoSchedule"
         },
         "volumeSize": 80
      },

The created nvidia-device-plugin daemonset only has the following tolerations:

      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

So it never gets scheduled on my GPU nodes.

eksctl should recognize additional taints on the nodegroup and automatically add tolerations for that to the nvidia-device-plugin it autoinstalls.

Logs

Anything else we need to know?

eksctl is awesome and the automatig GPU driver installation is a killer idea.

Versions

$ eksctl info
➜ eksctl info
eksctl version: 0.97.0
kubectl version: v1.23.5
OS: darwin

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions