Skip to content

A kernel module which hook nvidia ioctl api to get nvidia-smi show process in non host pid container.

Notifications You must be signed in to change notification settings

tedli/nvidia-pidns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

nvidia-pidns

非 host pid 模式下,容器内,nvidia-smi 显示容器内进程,nvidia 内核模块补丁。

背景

nvidia 闭源驱动,非 host pid 在容器内,nvidia-smi 显示不出容器内使用 gpu 的进程。 原因是nvidia-smi内逻辑调 ioctl 时,接口回的 pid 是宿主机 pid,在容器你进而 read /proc/<pid>/cmdline 获取进程名时失败所导致。

modinfo nvidia可以查看是否是闭源,LICENSE 是 NVIDIA 则为闭源。

英伟达有开源版驱动,开源版修复了此问题。但是只能显示进程,无法显示出进程使用的显存大小。

前辈

这几个仓库,是大佬在英伟达没开源出驱动之前,自己逆向,“猜”出的补丁,修复了此问题。致敬!:bow: 但是已经不更新维护,而且 linux 内核在这个 commit 已经去掉了 d_instantate_anon 接口,导致现有代码,在 6.8 内核后,无法编译通过。

编译

make -C /lib/modules/$(shell uname -r)/build M=$(pwd) modules
make -C /lib/modules/$(shell uname -r)/build M=$(pwd) clean

安装例

# 脱密后未验证,仅作参考
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-pidns-insmod
  labels:
    app: nvidia-pidns-insmod
spec:
  selector:
    matchLabels:
      app: nvidia-pidns-insmod
  updateStrategy:
    type: OnDelete
  template:
    metadata:
      labels:
        app: nvidia-pidns-insmod
    spec:
      containers:
        - name: insmod
          image: ubuntu:jammy-20240911.1
          command:
            - bash
          args:
            - -c
            - >
              if ! nsenter -t 1 -a lsmod | grep nvidia_pidns > /dev/null; then
              cp -Lrf /opt/nvidia-pidns/nvidia-pidns-$(uname -r).tar.xz /opt/host-tmp;
              nsenter -t 1 -a bash -c "tar -Jxf /tmp/nvidia-pidns-$(uname -r).tar.xz -C /;
              insmod /lib/modules/$(uname -r)/kernel/drivers/nvidia/nvidia-pidns.ko";
              fi;
              sleep infinity
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 200m
              memory: 256Mi
          securityContext:
            privileged: true
          volumeMounts:
            - name: nvidia-pidns
              mountPath: /opt/nvidia-pidns
              readOnly: true
            - name: tmp
              mountPath: /opt/host-tmp
      volumes:
        - name: nvidia-pidns
          configMap:
            name: nvidia-pidns-insmod
            defaultMode: 420
        - name: tmp
          hostPath:
            path: /tmp
            type: Directory
      hostNetwork: true
      hostPID: true
      hostIPC: true
      dnsPolicy: ClusterFirstWithHostNet
      tolerations:
        - operator: Exists
      automountServiceAccountToken: false
      priorityClassName: system-cluster-critical
---
apiVersion: v1
binaryData:
  nvidia-pidns-5.15.0-25-generic.tar.xz: 
  nvidia-pidns-5.15.0-107-generic.tar.xz: 
  nvidia-pidns-5.15.0-122-generic.tar.xz: 
kind: ConfigMap
metadata:
  name: nvidia-pidns-insmod
  labels:
    app: nvidia-pidns-insmod
    component: modules

About

A kernel module which hook nvidia ioctl api to get nvidia-smi show process in non host pid container.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published