Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcement: NVIDIA 535 series drivers will be backported to EKS optimized Accelerated AMIs with older Kubernetes versions #1448

Closed
ptailor1193 opened this issue Oct 2, 2023 · 11 comments
Labels
enhancement New feature or request gpu

Comments

@ptailor1193
Copy link

With Kubernetes version 1.28 or later, the EKS optimized Accelerated AMIs support NVIDIA 535 series or later drivers out of box. We plan to back port these drivers to older Kubernetes versions starting with 1.27 on October 10th, 2023.

@cartermckinnon
Copy link
Member

cartermckinnon commented Oct 3, 2023

⚠️ Note that this is a breaking change!

As noted in the 1.28 launch notes: the 535 series drivers are not compatible with the older chipsets used in the p2 instance family. This change is necessary to support the latest-and-greatest hardware in the p5 instance family. Instances in the p3 and p4 families will not be impacted by this change.

@tom-dixon-fiveai
Copy link

Hiya!

Will you also be backporting the 5.10 Linux Kernel with this?

Can I ask how many EKS versions you're going to go back as well please?

@cartermckinnon
Copy link
Member

Will you also be backporting the 5.10 Linux Kernel with this?

Yep! The older NVIDIA drivers are the only thing keeping us on 5.4.

Can I ask how many EKS versions you're going to go back as well please?

We intend to make this change in 1.25+.

@tom-dixon-fiveai
Copy link

Awesome, thanks very much! :D

@tom-dixon-fiveai
Copy link

Ah sorry, one more question: is there an ETA/schedule at all for the 1.25 version?

@sidewinder12s
Copy link

Is the GPU AMI build process planned to be exposed more in this repo with this change or is that not changing?

@tom-dixon-fiveai
Copy link

Hello again! :)

@cartermckinnon do you know when this might be happening at all/otherwise know of an update on this please?

@willgleich
Copy link
Contributor

willgleich commented Oct 13, 2023

@cartermckinnon I didn't see a eks-ami release on October 10th, wondering if the 1.27 backport is released?

Any timeline for 1.26?

@cartermckinnon cartermckinnon changed the title Announcement: NVIDIA 535 series drivers will be backported to EKS optimized Accelerated AMIs with older Kubernetes versions starting with 1.27 on October 10th, 2023 Announcement: NVIDIA 535 series drivers will be backported to EKS optimized Accelerated AMIs with older Kubernetes versions Oct 14, 2023
@cartermckinnon
Copy link
Member

I didn't see a eks-ami release on October 10th

A recent change in the kernel: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9011e49d54dcc7653ebb8a1e05b5badb5ecfa9f9 makes our current combination of NVIDIA and EFA drivers incompatible. We expect to have a path forward shortly; but we have to pause our backports in the meantime.

Is the GPU AMI build process planned to be exposed more in this repo with this change or is that not changing?

Yes, we plan to upstream the NVIDIA-related scripts.

@cartermckinnon
Copy link
Member

The next AMI release will extend the 535-series NVIDIA driver and CUDA 12 to Kubernetes versions 1.25 and above.

@ptailor1193
Copy link
Author

NVIDIA 535 series drivers have now been backported to EKS optimized Accelerated AMIs 1.25+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

No branches or pull requests

5 participants