Skip to content

Add tolerations for taints on NVIDIA specific node groups#5345

Merged
Skarlso merged 6 commits intomainfrom
nvidia_schedule
Jun 1, 2022
Merged

Add tolerations for taints on NVIDIA specific node groups#5345
Skarlso merged 6 commits intomainfrom
nvidia_schedule

Conversation

@Skarlso
Copy link
Contributor

@Skarlso Skarlso commented May 30, 2022

Description

Closes #5277

I believe this will do the trick, but I have to manually test it first. Is this right, @cPu1? I'm a bit unfamiliar with this part of the code. :)

TODO:

  • Manual run

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@Skarlso Skarlso added the kind/feature New feature or request label May 30, 2022
@Skarlso Skarlso force-pushed the nvidia_schedule branch from 83d4212 to f0bd8eb Compare May 30, 2022 14:52
@Skarlso Skarlso marked this pull request as draft May 30, 2022 14:54
@Skarlso Skarlso force-pushed the nvidia_schedule branch from f0bd8eb to a92b4d1 Compare May 30, 2022 15:05
@Skarlso Skarlso force-pushed the nvidia_schedule branch from a92b4d1 to fe09d5a Compare May 30, 2022 15:15
@Skarlso Skarlso marked this pull request as ready for review May 30, 2022 19:12
@Skarlso
Copy link
Contributor Author

Skarlso commented May 30, 2022

With the following config file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gb-test-cluster-1
  region: us-west-2
  version: '1.22'

nodeGroups:
  - name: ng-1
    minSize: 1
    maxSize: 2
    desiredCapacity: 1
    instanceType: p2.xlarge
    taints:
      feaster: "true:NoSchedule" 

Achieved this on the nvidia pod:

Tolerations:                 CriticalAddonsOnly op=Exists
                             feaster=true
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists

And the pod is successfully scheduled on the node:

kube-system   nvidia-device-plugin-daemonset-c4ln5   1/1     Running   0          11m

@cPu1
Copy link
Contributor

cPu1 commented May 31, 2022

Is this right, @cPu1?

Yes, the approach does look right to me 🙂

@Skarlso Skarlso requested a review from cPu1 May 31, 2022 08:28
Copy link
Contributor

@Himangini Himangini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend unit tests

I can't see this in the files diff, where is this added? 🤔

@Skarlso
Copy link
Contributor Author

Skarlso commented Jun 1, 2022

Right, that didn't happen because there aren't any. :D

Copy link
Contributor

@Himangini Himangini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻
At some point, we should add some tickets to improve testing around this and taints in general 💡

@Skarlso Skarlso merged commit dde684a into main Jun 1, 2022
@Skarlso Skarlso deleted the nvidia_schedule branch June 1, 2022 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] nvidia device plugin not scheduled on GPU nodes if they have additional taints

3 participants