Ubuntu GPU nodepool fails to install nvidia-device-plugin 

**Describe the bug**
When following this guide https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool
The nvidia-device-plugin is failing to detect the GPU on Ubuntu Linux OS, when doing a kubectl exec into the device plugin pod and manually running the startup script `nvidia-device-plugin` I get the following error `NVML: Unknown Error`
Additionally the GPU-enabled workload meant to test the gpu nodes, does not work in either the UbuntuLinux or the AzureLinux os skus 

**To Reproduce**
Steps to reproduce the behavior:
1. Create a gpu nodepool (node_vm_size: Standard_NC6s_v3, os_sku: UbuntuLinux) 
2. Create the gpu-resources namespace
3. Create and apply the nvidia-device-plugin daemonset 
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

```
4. Check that GPUs are schedulable 
`kubectl get nodes`
`kubectl describe <node name>`
```
Name:               <node name>
Roles:              agent
Labels:             accelerator=nvidia

[...]

Capacity:
[...]
 nvidia.com/gpu:                 1
[...]
```
5.  Find the nvidia-device-plugin pod with `kubectl get pods -n gpu-resources`
6. Exec into the pod with `kubectl exec -it <pod-name> -n gpu-resources -- /bin/bash`
7. Run `nvidia-smi` and it will throw an error instead of printing device details

**Expected behavior**
The nvidia device plugin should work on the UbuntuLinux os sku. I have confirmed it is working on the AzureLinux os sku but we require it to function on Ubuntu and the documentation suggests that it should.


**Environment (please complete the following information):**
 - CLI Version 2.56.0
 - Kubernetes version 1.29.4


**Additional context**
```root@nvidia-device-plugin-kdrnr:/# nvidia-device-plugin
I0913 14:57:53.424196      23 main.go:199] Starting FS watcher.
I0913 14:57:53.424264      23 main.go:206] Starting OS watcher.
I0913 14:57:53.424503      23 main.go:221] Starting Plugins.
I0913 14:57:53.424525      23 main.go:278] Loading configuration.
I0913 14:57:53.425286      23 main.go:303] Updating config with default resource matching patterns.
I0913 14:57:53.425456      23 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0913 14:57:53.425487      23 main.go:317] Retrieving plugins.
E0913 14:57:53.433018      23 factory.go:68] Failed to initialize NVML: Unknown Error.
E0913 14:57:53.433038      23 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0913 14:57:53.433047      23 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0913 14:57:53.433054      23 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0913 14:57:53.433061      23 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0913 14:57:53.433070      23 factory.go:76] nvml init failed: Unknown Error
I0913 14:57:53.433081      23 main.go:346] No devices found. Waiting indefinitely.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ubuntu GPU nodepool fails to install nvidia-device-plugin #4552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ubuntu GPU nodepool fails to install nvidia-device-plugin #4552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions