Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia DaemonSet seems to be broken on GKE cluster v1.23 #4064

Open
teetone opened this issue Apr 12, 2022 · 8 comments
Open

Nvidia DaemonSet seems to be broken on GKE cluster v1.23 #4064

teetone opened this issue Apr 12, 2022 · 8 comments
Assignees
Labels
p1 Do it in the next two weeks.

Comments

@teetone
Copy link
Collaborator

teetone commented Apr 12, 2022

The DaemonSet seems to fail when I start a GKE cluster on GCP and specify version 1.23.

https://github.com/codalab/codalab-worksheets/tree/master/docs/gcp

@teetone teetone added the p1 Do it in the next two weeks. label Apr 12, 2022
@teetone teetone self-assigned this Apr 12, 2022
@teetone
Copy link
Collaborator Author

teetone commented May 17, 2022

I was debugging and created a new cluster. I ran into this new error:

WARNING: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
ERROR: (gcloud.container.clusters.create) ResponseError: code=400, message=Creation of node pools using node images based on Docker container runtimes is not supported in GKE v1.23. This is to prepare for the removal of Dockershim in Kubernetes v1.24. We recommend that you migrate to image types based on Containerd (examples). For more information, contact Cloud Support.

They recently started to enforce using containerd instead of Docker from v1.23 (not v1.24 as we initially thought).

@teetone teetone closed this as completed May 17, 2022
@epicfaace
Copy link
Member

Change the default node pool node image from ubuntu (https://cloud.google.com/kubernetes-engine/docs/concepts/node-images) to ubuntu_containerd (https://cloud.google.com/kubernetes-engine/docs/concepts/using-containerd)

@epicfaace epicfaace reopened this May 17, 2022
@epicfaace
Copy link
Member

  • Try creating a new cluster
  • see if docker is still available

@teetone
Copy link
Collaborator Author

teetone commented May 23, 2022

It turns out that starting from GKE v1.23, the default container run time for the nodes is containerd and not Docker.

kubectl get nodes -o wide:

NAME                                                  STATUS   ROLES    AGE     VERSION            INTERNAL-IP   EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
gke-cluster-usw-default-pool-87e27804-hlp3   Ready    <none>   3h11m   v1.23.5-gke.2400   10.138.0.50   34.83.111.51   Ubuntu 20.04.4 LTS   5.4.0-1067-gke   containerd://1.5.2
gke-cluster-uswest-gpu-pool-f3a4ca3a-nbwd    Ready    <none>   56m     v1.23.5-gke.2400   10.138.0.70   34.82.19.102   Ubuntu 20.04.4 LTS   5.4.0-1067-gke   containerd://1.5.2

We need to create a separate Nvidia GPU setup for Pods vs. Docker containers.

@pranavjain
Copy link
Contributor

@teetone Please look into this issue and update/close as per today's discussion.

@pranavjain
Copy link
Contributor

Waiting for @epicfaace change before investigating this further.

@epicfaace
Copy link
Member

We will fix #3975 and see if it resolves this issue, as it is likely to resolve this issue (because we probably won't be using NVIDIA DaemonSets anymore).

@teetone teetone removed their assignment Dec 25, 2022
@percyliang
Copy link
Collaborator

Can we check if this is an issue and close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Do it in the next two weeks.
Projects
Status: Blocked
Development

No branches or pull requests

4 participants