Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tolerations for taints on NVIDIA specific node groups #5345

Merged
merged 6 commits into from
Jun 1, 2022

Conversation

Skarlso
Copy link
Contributor

@Skarlso Skarlso commented May 30, 2022

Description

Closes #5277

I believe this will do the trick, but I have to manually test it first. Is this right, @cPu1? I'm a bit unfamiliar with this part of the code. :)

TODO:

  • Manual run

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟
Not Found
@Skarlso Skarlso added the kind/feature New feature or request label May 30, 2022
@Skarlso Skarlso force-pushed the nvidia_schedule branch from 83d4212 to f0bd8eb Compare May 30, 2022 14:52
@Skarlso Skarlso marked this pull request as draft May 30, 2022 14:54
@Skarlso Skarlso force-pushed the nvidia_schedule branch from f0bd8eb to a92b4d1 Compare May 30, 2022 15:05
@Skarlso Skarlso force-pushed the nvidia_schedule branch from a92b4d1 to fe09d5a Compare May 30, 2022 15:15
@Skarlso Skarlso marked this pull request as ready for review May 30, 2022 19:12
@Skarlso
Copy link
Contributor Author

Skarlso commented May 30, 2022

With the following config file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gb-test-cluster-1
  region: us-west-2
  version: '1.22'

nodeGroups:
  - name: ng-1
    minSize: 1
    maxSize: 2
    desiredCapacity: 1
    instanceType: p2.xlarge
    taints:
      feaster: "true:NoSchedule" 

Achieved this on the nvidia pod:

Tolerations:                 CriticalAddonsOnly op=Exists
                             feaster=true
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists

And the pod is successfully scheduled on the node:

kube-system   nvidia-device-plugin-daemonset-c4ln5   1/1     Running   0          11m

@cPu1
Copy link
Contributor

cPu1 commented May 31, 2022

Is this right, @cPu1?

Yes, the approach does look right to me 🙂

@Skarlso Skarlso requested a review from cPu1 May 31, 2022 08:28
Copy link
Contributor

@Himangini Himangini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend unit tests

I can't see this in the files diff, where is this added? 🤔

@Skarlso
Copy link
Contributor Author

Skarlso commented Jun 1, 2022

Right, that didn't happen because there aren't any. :D

Copy link
Contributor

@Himangini Himangini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻
At some point, we should add some tickets to improve testing around this and taints in general 💡

@Skarlso Skarlso merged commit dde684a into main Jun 1, 2022
@Skarlso Skarlso deleted the nvidia_schedule branch June 1, 2022 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] nvidia device plugin not scheduled on GPU nodes if they have additional taints
3 participants