rancher · manuelbuil · Oct 21, 2025 · Oct 17, 2025
@@ -0,0 +1,155 @@
+---
+title: GPU Operators
+---
+
+## Deploy NVIDIA operator
+
+The [NVIDIA operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) allows administrators of Kubernetes clusters to manage GPUs just like CPUs. It includes everything needed for pods to be able to operate GPUs.
+
+### Host OS requirements
+
+To expose the GPU to the pod correctly, the NVIDIA kernel drivers and the `libnvidia-ml` library must be correctly installed in the host OS. The NVIDIA Operator can automatically install drivers and libraries on some operating systems; check the NVIDIA documentation for information on [supported operating system releases](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms). Installation of the NVIDIA components on your host OS is out of the scope of this document; reference the NVIDIA documentation for instructions.
+
+The following three commands should return a correct output if the kernel driver was correctly installed:
+
+1.  `lsmod | grep nvidia`
+
+    Returns a list of nvidia kernel modules, for example:
+
+    ```
+    nvidia_uvm           2129920  0
+    nvidia_drm            131072  0
+    nvidia_modeset       1572864  1 nvidia_drm
+    video                  77824  1 nvidia_modeset
+    nvidia               9965568  2 nvidia_uvm,nvidia_modeset
+    ecc                    45056  1 nvidia
+    ```
+
+2.  `cat /proc/driver/nvidia/version`
+
+    returns the NVRM and GCC version of the driver. For example:
+
+    ```
+    NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  555.42.06  Release Build  (abuild@host)  Thu Jul 11 12:00:00 UTC 2024
+    GCC version:  gcc version 7.5.0 (SUSE Linux) 
+    ```
+
+3.  `find /usr/ -iname libnvidia-ml.so`
+
+    returns a path to the `libnvidia-ml.so` library. For example:
+
+    ```
+    /usr/lib64/libnvidia-ml.so
+    ```
+
+    This library is used by Kubernetes components to interact with the kernel driver.
+
+
+### Operator installation ###
+
+Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:
+```yaml
+apiVersion: helm.cattle.io/v1
+kind: HelmChart
+metadata:
+  name: gpu-operator
+  namespace: kube-system
+spec:
+  repo: https://helm.ngc.nvidia.com/nvidia
+  chart: gpu-operator
+  version: v25.3.4
+  targetNamespace: gpu-operator
+  createNamespace: true
+  valuesContent: |-
+    toolkit:
+      env:
+      - name: CONTAINERD_SOCKET
+        value: /run/k3s/containerd/containerd.sock
+      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
+        value: "false"
+      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
+        value: "true"
+    devicePlugin:
+      env:
+      - name: DEVICE_LIST_STRATEGY
+        value: volume-mounts
+```
+:::warning
+The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
+:::
+
+:::info
+The envvars `ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED`, `ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS` and `DEVICE_LIST_STRATEGY` are required to properly isolate GPU resources as explained in this nvidia [doc](https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?tab=t.0)
+:::
+
+After one minute approximately, you can make the following checks to verify that everything worked as expected:
+
+1. Assuming the drivers and `libnvidia-ml.so` library were previously installed, check if the operator detects them correctly:
+    ```
+    kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | grep "nvidia.com/gpu.deploy.driver"
+    ```
+    You should see the value `pre-installed`. If you see `true`, the drivers were not correctly installed. If the [pre-requirements](#host-os-requirements) were correct, it is possible that you forgot to reboot the node after installing all packages.
+
+    You can also check other driver labels with:
+    ```
+    kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' |  grep "nvidia.com"
+    ```
+    You should see labels specifying driver and GPU (e.g. nvidia.com/gpu.machine or nvidia.com/cuda.driver.major)
+
+2. Check if the gpu was added (by nvidia-device-plugin-daemonset) as an allocatable resource in the node:
+    ```
+    kubectl get node $NODENAME -o jsonpath='{.status.allocatable}'
+    ```
+    You should see `"nvidia.com/gpu":` followed by the number of gpus in the node
+
+3. Check that the container runtime binary was installed by the operator (in particular by the `nvidia-container-toolkit-daemonset`):
+    ```
+    ls /usr/local/nvidia/toolkit/nvidia-container-runtime
+    ```
+
+4. Verify if containerd config was updated to include the nvidia container runtime:
+    ```
+    grep nvidia /var/lib/rancher/rke2/agent/etc/containerd/config.toml
+    ```
+
+5. Run a pod to verify that the GPU resource can successfully be scheduled on a pod and the pod can detect it
+    ```yaml
+    apiVersion: v1
+    kind: Pod
+    metadata:
+      name: nbody-gpu-benchmark
+      namespace: default
+    spec:
+      restartPolicy: OnFailure
+      runtimeClassName: nvidia
+      containers:
+      - name: cuda-container
+        image: nvcr.io/nvidia/k8s/cuda-sample:nbody
+        args: ["nbody", "-gpu", "-benchmark"]
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+    ```
+
+:::info Version Gate
+Available as of October 2024 releases: v1.28.15+rke2r1, v1.29.10+rke2r1, v1.30.6+rke2r1, v1.31.2+rke2r1.
+:::
+
+RKE2 will now use `PATH` to find alternative container runtimes, in addition to checking the default paths used by the container runtime packages. In order to use this feature, you must modify the RKE2 service's PATH environment variable to add the directories containing the container runtime binaries.
+
+It's recommended that you modify one of this two environment files:
+
+- `/etc/default/rke2-server` # or rke2-agent
+- `/etc/sysconfig/rke2-server` # or rke2-agent
+
+This example will add the `PATH` in `/etc/default/rke2-server`:
+
+```bash
+echo PATH=$PATH >> /etc/default/rke2-server
+```
+
+:::warning
+`PATH` changes should be done with care to avoid placing untrusted binaries in the path of services that run as root.
+:::
+
+
@@ -1,10 +1,10 @@
 ---
-title: Helm Integration
+title: Helm
 ---
 
 Helm is the package management tool of choice for Kubernetes. Helm charts provide templating syntax for Kubernetes YAML manifest documents. With Helm we can create configurable deployments instead of just using static files. For more information about creating your own catalog of deployments, check out the docs at [https://helm.sh/docs/intro/quickstart/](https://helm.sh/docs/intro/quickstart/).
 
-RKE2 does not require any special configuration to use with Helm command-line tools. Just be sure you have properly set up your kubeconfig as per the section about [cluster access](./cluster_access.md). RKE2 does include some extra functionality to make deploying both traditional Kubernetes resource manifests and Helm Charts even easier with the [rancher/helm-release CRD.](#using-the-helm-crd)
+RKE2 does not require any special configuration to use with Helm command-line tools. Just be sure you have properly set up your kubeconfig as per the section about [cluster access](../cluster_access.md). RKE2 does include some extra functionality to make deploying both traditional Kubernetes resource manifests and Helm Charts even easier with the [rancher/helm-release CRD.](#using-the-helm-crd)
 
 ## Automatically Deploying Manifests and Helm Charts
 

@@ -6,7 +6,7 @@ Container images are cached locally on each node by the containerd image store.
 
 ## On-demand image pulling
 
-Kubernetes, by default, automatically pulls images when a Pod requires them if the image is not already present on the node. This behavior can be changed by using the [image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy) field of the Pod. When using the default `IfNotPresent` policy, containerd will pull the image from either upstream (default) or your [private registry](install/private_registry.md) and store it in its image store. Users do not need to apply any additional configuration for on-demand image pulling to work.
+Kubernetes, by default, automatically pulls images when a Pod requires them if the image is not already present on the node. This behavior can be changed by using the [image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy) field of the Pod. When using the default `IfNotPresent` policy, containerd will pull the image from either upstream (default) or your [private registry](../install/private_registry.md) and store it in its image store. Users do not need to apply any additional configuration for on-demand image pulling to work.
 
 
 ## Pre-import images
@@ -22,7 +22,7 @@ RKE2 includes two mechanisms to pre-import images into the containerd image stor
 <Tabs groupId="import-images" queryString>
 <TabItem value="Online image importing" default>
 
-Users can trigger a pull of images into the containerd image store by placing a text file containing the image names, one per line, in the `/var/lib/rancher/k3s/agent/images` directory. The text file can be placed before RKE2 is started, or created/modified while RKE2 is running. RKE2 will sequentially pull the images via the CRI API, optionally using the [registries.yaml](install/private_registry.md) configuration.
+Users can trigger a pull of images into the containerd image store by placing a text file containing the image names, one per line, in the `/var/lib/rancher/k3s/agent/images` directory. The text file can be placed before RKE2 is started, or created/modified while RKE2 is running. RKE2 will sequentially pull the images via the CRI API, optionally using the [registries.yaml](../install/private_registry.md) configuration.
 
 For example:
 
@@ -58,7 +58,7 @@ After a few seconds, the images included in the image tarball will be available
 
 Use `ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images list` to query the containerd image store.
 
-This is the method used in Airgap. Please follow the [Airgap install documentation](install/airgap.md) for detailed information.
+This is the method used in Airgap. Please follow the [Airgap install documentation](../install/airgap.md) for detailed information.
 
 </TabItem>
 </Tabs>
@@ -67,6 +67,6 @@ This is the method used in Airgap. Please follow the [Airgap install documentati
 
 RKE2 supports two alternatives for image registries:
 
-* [Private Registry Configuration](install/private_registry.md) covers use of `registries.yaml` to configure container image registry authentication and mirroring.
+* [Private Registry Configuration](../install/private_registry.md) covers use of `registries.yaml` to configure container image registry authentication and mirroring.
 
-* [Embedded Registry Mirror](install/registry_mirror.md) shows how to enable the embedded distributed image registry mirror, for peer-to-peer sharing of images between nodes.
+* [Embedded Registry Mirror](../install/registry_mirror.md) shows how to enable the embedded distributed image registry mirror, for peer-to-peer sharing of images between nodes.
@@ -27,7 +27,7 @@ It is also possible to rotate an individual service by passing the `--service` f
 
 Any file found in `/var/lib/rancher/rke2/server/manifests` will automatically be deployed to Kubernetes in a manner similar to `kubectl apply`.
 
-For information about deploying Helm charts using the manifests directory, refer to the section about [Helm.](helm.md)
+For information about deploying Helm charts using the manifests directory, refer to the section about [Helm.](add-ons/helm.md)
 
 ## Configuring containerd
 
@@ -288,146 +288,4 @@ kube-apiserver-extra-env:
 kube-scheduler-extra-env: "TZ=America/Los_Angeles"
 ```
 
-## Deploy NVIDIA operator
-
-The [NVIDIA operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) allows administrators of Kubernetes clusters to manage GPUs just like CPUs. It includes everything needed for pods to be able to operate GPUs.
-
-### Host OS requirements
-
-To expose the GPU to the pod correctly, the NVIDIA kernel drivers and the `libnvidia-ml` library must be correctly installed in the host OS. The NVIDIA Operator can automatically install drivers and libraries on some operating systems; check the NVIDIA documentation for information on [supported operating system releases](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms). Installation of the NVIDIA components on your host OS is out of the scope of this document; reference the NVIDIA documentation for instructions.
-
-The following three commands should return a correct output if the kernel driver was correctly installed:
-
-1 - `lsmod | grep nvidia`
-
-Returns a list of nvidia kernel modules, for example:
-
-```
-nvidia_uvm           2129920  0
-nvidia_drm            131072  0
-nvidia_modeset       1572864  1 nvidia_drm
-video                  77824  1 nvidia_modeset
-nvidia               9965568  2 nvidia_uvm,nvidia_modeset
-ecc                    45056  1 nvidia
-```
-
-2 - `cat /proc/driver/nvidia/version`
-
-returns the NVRM and GCC version of the driver. For example:
-
-```
-NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  555.42.06  Release Build  (abuild@host)  Thu Jul 11 12:00:00 UTC 2024
-GCC version:  gcc version 7.5.0 (SUSE Linux) 
-```
-
-3 - `find /usr/ -iname libnvidia-ml.so`
-
-returns a path to the `libnvidia-ml.so` library. For example:
-
-```
-/usr/lib64/libnvidia-ml.so
-```
-
-This library is used by Kubernetes components to interact with the kernel driver.
-
-
-### Operator installation ###
-
-Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:
-```yaml
-apiVersion: helm.cattle.io/v1
-kind: HelmChart
-metadata:
-  name: gpu-operator
-  namespace: kube-system
-spec:
-  repo: https://helm.ngc.nvidia.com/nvidia
-  chart: gpu-operator
-  targetNamespace: gpu-operator
-  createNamespace: true
-  valuesContent: |-
-    toolkit:
-      env:
-      - name: CONTAINERD_SOCKET
-        value: /run/k3s/containerd/containerd.sock
-```
-:::warning
-The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
-:::
-
-After one minute approximately, you can make the following checks to verify that everything worked as expected:
-
-1 - Assuming the drivers and `libnvidia-ml.so` library were previously installed, check if the operator detects them correctly:
-```
-kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | grep "nvidia.com/gpu.deploy.driver"
-```
-You should see the value `pre-installed`. If you see `true`, the drivers were not correctly installed. If the [pre-requirements](#host-os-requirements) were correct, it is possible that you forgot to reboot the node after installing all packages.
-
-You can also check other driver labels with:
-```
-kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | jq | grep "nvidia.com"
-```
-You should see labels specifying driver and GPU (e.g. nvidia.com/gpu.machine or nvidia.com/cuda.driver.major)
-
-2 - Check if the gpu was added (by nvidia-device-plugin-daemonset) as an allocatable resource in the node:
-```
-kubectl get node $NODENAME -o jsonpath='{.status.allocatable}' | jq
-```
-You should see `"nvidia.com/gpu":` followed by the number of gpus in the node
-
-3 - Check that the container runtime binary was installed by the operator (in particular by the `nvidia-container-toolkit-daemonset`):
-```
-ls /usr/local/nvidia/toolkit/nvidia-container-runtime
-```
-
-4 - Verify if containerd config was updated to include the nvidia container runtime:
-```
-grep nvidia /var/lib/rancher/rke2/agent/etc/containerd/config.toml
-```
-
-5 - Run a pod to verify that the GPU resource can successfully be scheduled on a pod and the pod can detect it
-```yaml
-apiVersion: v1
-kind: Pod
-metadata:
-  name: nbody-gpu-benchmark
-  namespace: default
-spec:
-  restartPolicy: OnFailure
-  runtimeClassName: nvidia
-  containers:
-  - name: cuda-container
-    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
-    args: ["nbody", "-gpu", "-benchmark"]
-    resources:
-      limits:
-        nvidia.com/gpu: 1
-    env:
-    - name: NVIDIA_VISIBLE_DEVICES
-      value: all
-    - name: NVIDIA_DRIVER_CAPABILITIES
-      value: compute,utility
-```
-
-:::info Version Gate
-Available as of October 2024 releases: v1.28.15+rke2r1, v1.29.10+rke2r1, v1.30.6+rke2r1, v1.31.2+rke2r1.
-:::
-
-RKE2 will now use `PATH` to find alternative container runtimes, in addition to checking the default paths used by the container runtime packages. In order to use this feature, you must modify the RKE2 service's PATH environment variable to add the directories containing the container runtime binaries.
-
-It's recommended that you modify one of this two environment files:
-
-- /etc/default/rke2-server # or rke2-agent
-- /etc/sysconfig/rke2-server # or rke2-agent
-
-This example will add the `PATH` in `/etc/default/rke2-server`:
-
-```bash
-echo PATH=$PATH >> /etc/default/rke2-server
-```
-
-:::warning
-`PATH` changes should be done with care to avoid placing untrusted binaries in the path of services that run as root.
-:::
-
 
@@ -44,7 +44,7 @@ For any file under `/var/lib/rancher/rke2/server/manifests`, you can create a `.
 
 ## Helm AddOns
 
-For information about managing Helm charts via auto-deploying manifests, refer to the section about [Helm.](../helm.md)
+For information about managing Helm charts via auto-deploying manifests, refer to the section about [Helm.](../add-ons/helm.md)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -44,7 +44,7 @@ For any file under `/var/lib/rancher/rke2/server/manifests`, you can create a `.

		## Helm AddOns

		For information about managing Helm charts via auto-deploying manifests, refer to the section about [Helm.](../helm.md)
		For information about managing Helm charts via auto-deploying manifests, refer to the section about [Helm.](../add-ons/helm.md)