Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest NVIDIA Container Runtime Support not working anymore with K3S #8248

Closed
jpabbuehl opened this issue Aug 26, 2023 · 11 comments
Closed

Latest NVIDIA Container Runtime Support not working anymore with K3S #8248

jpabbuehl opened this issue Aug 26, 2023 · 11 comments

Comments

@jpabbuehl
Copy link

jpabbuehl commented Aug 26, 2023

Environmental Info:

K3s Version: v1.27.4+k3s1

Node(s) CPU architecture, OS, and Version:

  • Server - Linux pop-os 6.4.6-76060406-generic #202307241739169092810522.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux
  • Agent - Linux gpu1 5.15.0-79-generic Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux with NVIDIA RTX 4090

Cluster Configuration:

1 Server, 1 agent

Describe the bug:

Nvidia device plugin pod with crashloopbackoff, unable to detect GPU.
The documentation to enable GPU workload doesn't work anymore when using latest nvidia drivers (535) and Nvidia runtime toolkit (1.13.5) here https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support

Steps To Reproduce:

  1. Install latest Nvidia drivers on agent with GPU RTX 4090 per Nvidia-Cuda documentation https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#common-installation-instructions-for-ubuntu

Note: I installed both with and without base because I wasn't sure how to proceed regarding CDI support in K3S

nvidia-smi
Sat Aug 26 07:42:01 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:04:00.0 Off |                  Off |
|  0%   39C    P8              37W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. Install Nvidia container toolkit per Nvidia documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-1-install-nvidia-container-toolkit
nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.13.5
commit: 6b8589dcb4dead72ab64f14a5912886e6165c079
  1. k3s is catching up Nvidia container runtime automatically

Note: I have restarted k3s-agent just in case

sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/dc43f496a0a9ac19d3b2444d390db38e0cfb38e672721f838b075422b8734994/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  1. Testing succesfully containerd with nvidia runtime directly on agent after ssh
sudo ctr image pull docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04
sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     --env NVIDIA_VISIBLE_DEVICES=all     docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04     cuda-11.6.2-base-ubuntu20.04 nvidia-smi

Sat Aug 26 08:12:47 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:04:00.0 Off |                  Off |
|  0%   39C    P8              36W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. Back to server (control plane) - Installing NVIDIA device plugin (v0.14) via helm per instruction in https://github.com/NVIDIA/k8s-device-plugin

Note: there are additional containerd instructions required here which I didn't follow https://github.com/NVIDIA/k8s-device-plugin#configure-containerd

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.1
  1. Adding label and taint to restrict daemonset to agent with gpu installed only
kubectl label nodes gpu1 gpu=installed
kubectl taint nodes pop-os gpu:NoSchedule
  1. Pods keep crashing, logs showing device not detected
kubectl get pods -n nvidia-device-plugin -o wide
NAME                              READY   STATUS             RESTARTS        AGE   IP          NODE   NOMINATED NODE   READINESS GATES
nvdp-nvidia-device-plugin-zfkj7   0/1     CrashLoopBackOff   18 (3m7s ago)   70m   10.42.1.5   gpu1   <none>           <none> 
kubectl logs -n nvidia-device-plugin nvdp-nvidia-device-plugin-zfkj7
I0826 08:14:48.251257       1 main.go:154] Starting FS watcher.
I0826 08:14:48.251291       1 main.go:161] Starting OS watcher.
I0826 08:14:48.251529       1 main.go:176] Starting Plugins.
I0826 08:14:48.251544       1 main.go:234] Loading configuration.
I0826 08:14:48.251682       1 main.go:242] Updating config with default resource matching patterns.
I0826 08:14:48.251934       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0826 08:14:48.251947       1 main.go:256] Retreiving plugins.
W0826 08:14:48.252203       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0826 08:14:48.252244       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0826 08:14:48.252266       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0826 08:14:48.252276       1 factory.go:115] Incompatible platform detected
E0826 08:14:48.252281       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0826 08:14:48.252287       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0826 08:14:48.252292       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0826 08:14:48.252297       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0826 08:14:48.258856       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

Expected behavior:

Expecting kubectl describe node gpu1 detecting GPU specification and adding annotation

Actual behavior:

the node gpu1 not showing any GPU related component. I didn't run the nbody-gpu-benchmark pod to test, given the limit resource specification nbody-gpu-benchmark

Additional context / logs:

The K3S documentation for Nvidia runtime https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support describes a working solution using driver 515.

I used this approach successfully until now (with k3s. v1.24, NFD v.013 and gpu-feature-discovery) but I have recently upgraded my GPU and installed newer driver version 535 for compatibility. Also reinstalled k3s v1.27.4+k3s1 in the process

Ideas for resolution

  1. it could be a regression issue by using latest nvidia driver 535 but haven't tested out yet, given how long it would take to downgrade and test out.

  2. There are additional instruction for containerd configuration with runtime described in Nvidia device plugin which I didn't follow. https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
    Shall I define them in config.toml.tmpl ?

  3. There is now CDI https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-2-generate-a-cdi-specification but no instruction for containerd, even less for k3s.

Not sure if this is on K3S or Nvidia side, looking forward to hearing your feedback
Thank you in advance

Jean-Paul

@jpabbuehl jpabbuehl changed the title NVIDIA Container Runtime Support not working with K3S Latest NVIDIA Container Runtime Support not working anymore with K3S Aug 26, 2023
@jmagoon
Copy link

jmagoon commented Aug 30, 2023

I had this same issue and I was able to fix it by applying the changes from https://github.com/NVIDIA/k8s-device-plugin#configure-containerd in a config.toml.tmpl based on the format here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go. That also included removing the default nvidia plugin detection in the template (which could probably be brought back to fit with the correct config). Here's the diff:

root@magoon:/var/lib/rancher/k3s/agent/etc/containerd# diff config.toml.default config.toml.nvidia
27a28
>   default_runtime_name = "nvidia"
74c75,78
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
---
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
>   privileged_without_host_devices = false
>   runtime_engine = ""
>   runtime_root = ""
75a80,81
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
>   BinaryName = "/usr/bin/nvidia-container-runtime"
>   SystemdCgroup = {{ .SystemdCgroup }}
76a83,84
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
>   runtime_type = "io.containerd.runc.v2"
112,117d119
< {{range $k, $v := .ExtraRuntimes}}
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}"]
<   runtime_type = "{{$v.RuntimeType}}"
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}".options]
<   BinaryName = "{{$v.BinaryName}}"
< {{end}}

I restarted k3s and I also had to delete the nvidia-device-plugin-daemonset pod:
kubectl delete pod nvidia-device-plugin-daemonset-b6lqm -n kube-system

After that it stopped showing:

I0830 23:04:19.417692       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

And logged:

I0830 23:07:20.795657       1 main.go:256] Retreiving plugins.
I0830 23:07:20.796097       1 factory.go:107] Detected NVML platform: found NVML library
I0830 23:07:20.796128       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0830 23:07:21.658884       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0830 23:07:21.659483       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0830 23:07:21.661373       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

One thing to be aware of that I'm still checking on is that after a reboot, all of my kube-system pods started to fail with CrashLoopBackoff. I found that other people had an issue linked with the Cgroup line in #5454. I confirmed that removing the nvidia config from the config.toml.tmpl file stops the CrashLoopBackoff condition but I'm still not entirely sure why.

edit: Note, after adding the SystemdCgroup line to the nvidia runtime option section, my containers stopped crashing:

> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
>   BinaryName = "/usr/bin/nvidia-container-runtime"
>   SystemdCgroup = {{ .SystemdCgroup }}

@brandond
Copy link
Contributor

brandond commented Sep 6, 2023

It sounds like the main difference here is just that we need to set SystemdCgroup in the nvidia runtime options?

Do you know which release of the nvidia container runtime started requiring this?

@matusnovak
Copy link

Relevant issue: NVIDIA/k8s-device-plugin#406

@kannan-scalers-ai
Copy link

kannan-scalers-ai commented Sep 28, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

@matusnovak
Copy link

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

That link gives me HTTP 404

However, I have solved the could not load NVML library: libnvidia-ml.so.1 issue by adding runtimeClassName: nvidia to the K8s device plugin from here https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml (or modify the Helm template, either works).

The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the nvidia-device-plugin.yml, expect that the default runtime is nvidia, which is not. Explicitly adding that in, or using a correct k3s config template to make it the default runtime, will solve the issue.

@xinmans
Copy link

xinmans commented Oct 15, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

That link gives me HTTP 404

However, I have solved the could not load NVML library: libnvidia-ml.so.1 issue by adding runtimeClassName: nvidia to the K8s device plugin from here https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml (or modify the Helm template, either works).

The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the nvidia-device-plugin.yml, expect that the default runtime is nvidia, which is not. Explicitly adding that in, or using a correct k3s config template to make it the default runtime, will solve the issue.

not work,
Error creating: pods "nvidia-device-plugin-daemonset-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

@matusnovak
Copy link

matusnovak commented Oct 16, 2023

not work,
Error creating: pods "nvidia-device-plugin-daemonset-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

@xinmans Try applying this manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 

And re-create the nvidia plugin.

Relevant: NVIDIA/k8s-device-plugin#406 (comment)

@henryford
Copy link

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

There's a dot at the end of the URL for some reason, that needs to be removed. In any case, the mentioned article uses the GPU operator which in turn uses the operator framework which automates this whole process. It did immediately work for me, ymmv.

https://github.com/NVIDIA/gpu-operator

Using helm:

$: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

$: helm install --wait nvidiagpu \
     -n gpu-operator --create-namespace \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
     nvidia/gpu-operator

NAME: nvidiagpu
LAST DEPLOYED: Tue Aug  8 00:54:41 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

@kannan-scalers-ai
Copy link

@henryford my bad, I updated the medium article link. Good to see that you got it working.

@cboettig
Copy link

I cannot get K38s to recognize my GPU. I have followed the official docs, and my config.toml lists the nvidia entries:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

nvidia-smi works as expected, even when testing with docker directly (which uses the nvidia-container toolkit, though I think my K38s is using the default containerd mode instead of docker mode):

docker run --rm -ti --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Wed Jan 10 19:19:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080        Off | 00000000:0A:00.0 Off |                  N/A |
| 27%   34C    P8               3W / 225W |     15MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

But checking for GPU availability on my node I get:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME     GPUs
thelio   <none>

and any pod intialized with GPU remains in Init:CrashLoopBackOff (i.e. the pods created in the gpu-operator helm chart shown above), or in Pending, waiting for gpu resource. What have I missed?

Notes/additional questions:

  • The docs mention that e.g. cuda-drivers-fabricmanager-515 nvidia-headless-515-server must be installed in addition. As you see in the nvidia-smi output, I'm on 545, but there is no 545 version of these packages available in the repos. Is that most likely my problem? How can that be resolved?
  • Some sources suggest that nvidia-runtime should also be the default runtime in containerd/config.toml. Is that really accurate? It looks to me like the official K3S docs configuration has this an opt-in via runtimeClassName: nvidia, which makes sense (no need to use it on pods that don't need GPU) and I'm using that test spec, so I don't think that's the issue?

@caroline-suse-rancher
Copy link
Contributor

I'm going to convert this to a discussion, as it seems like a K8s/NVIDIA related issue, rather than a k3s bug

@k3s-io k3s-io locked and limited conversation to collaborators Jan 12, 2024
@caroline-suse-rancher caroline-suse-rancher converted this issue into discussion #9231 Jan 12, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

9 participants