-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvidia DaemonSet seems to be broken on GKE cluster v1.23 #4064
Comments
I was debugging and created a new cluster. I ran into this new error:
They recently started to enforce using containerd instead of Docker from v1.23 (not v1.24 as we initially thought). |
Change the default node pool node image from ubuntu (https://cloud.google.com/kubernetes-engine/docs/concepts/node-images) to ubuntu_containerd (https://cloud.google.com/kubernetes-engine/docs/concepts/using-containerd) |
|
It turns out that starting from GKE v1.23, the default container run time for the nodes is containerd and not Docker.
We need to create a separate Nvidia GPU setup for Pods vs. Docker containers. |
@teetone Please look into this issue and update/close as per today's discussion. |
Waiting for @epicfaace change before investigating this further. |
We will fix #3975 and see if it resolves this issue, as it is likely to resolve this issue (because we probably won't be using NVIDIA DaemonSets anymore). |
Can we check if this is an issue and close it? |
The DaemonSet seems to fail when I start a GKE cluster on GCP and specify version 1.23.
https://github.com/codalab/codalab-worksheets/tree/master/docs/gcp
The text was updated successfully, but these errors were encountered: