vgpu not restricting memory in the container #3384

kunal642 · 2024-04-03T12:36:24Z

What happened:

When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)

What you expected to happen:

The memory inside the container should be limited to vgpu-memory configuration.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Nvidia-smi version: 545.23.08
MIG M: NA

Environment:

Volcano Version: 1.8.x
Kubernetes version (use kubectl version): v1.28.x
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

lowang-bh · 2024-04-04T03:34:48Z

/assign @archlitchi

kunal642 · 2024-04-10T04:50:50Z

Hey @archlitchi, Can you suggest something for this?

archlitchi · 2024-04-10T07:03:55Z

Hey @archlitchi, Can you suggest something for this?

could you provide the following information:

The vgpu-task yaml you submitted?
"env" result inside container

kunal642 · 2024-04-10T07:14:00Z

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 200

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

archlitchi · 2024-04-10T07:19:58Z

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 3000

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

Could you provide the 'env' result inside container?

kunal642 · 2024-04-10T07:22:27Z

wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you

archlitchi · 2024-04-10T07:26:10Z

okay, please list env which contains keyword 'CUDA' or 'NVIDIA'

kunal642 · 2024-04-10T07:35:17Z

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

archlitchi · 2024-04-10T07:42:08Z

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04

kunal642 · 2024-04-10T07:43:41Z

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

archlitchi · 2024-04-10T07:47:23Z

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:

/usr/local/vgpu/libvgpu.so
/etc/ld.so.preload

kunal642 · 2024-04-10T07:53:30Z

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size
/etc/ld.so.preload > does not exist

archlitchi · 2024-04-10T07:57:36Z

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist

okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml

kunal642 · 2024-04-10T08:32:17Z

okay, let me try this!!!

kunal642 · 2024-04-16T09:55:46Z

Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then?

kunal642 · 2024-04-23T04:55:42Z

Hey @archlitchi, any other suggestions to fix this?

EswarS · 2024-04-29T19:09:09Z

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

archlitchi · 2024-04-30T01:48:32Z

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@kunal642

ok, i'm looking into it now, sorry i didn't see your replies last two weeks

archlitchi · 2024-04-30T03:07:37Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

kunal642 · 2024-05-02T10:34:56Z

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

EswarS · 2024-05-03T10:04:43Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin
Facing following error
Initializing …..
Fail to open shrreg ***.cache (errorno:11)
Fail to init shrreg ****.cache (errorno:9)
Fail to write shrreg ***.cache (errorno:9)
Fail to reseek shrreg ***.cache (errorno:9)
Fail to lock shrreg ***.cache (errorno:9)

archlitchi · 2024-05-06T01:30:03Z

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

yes, can you run your task now?

archlitchi · 2024-05-06T01:35:12Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)

The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists

EswarS · 2024-05-08T05:50:33Z

VolumeMounts:
-mountPath: /car/lib/kubelet/device-plugins
-name: device-plugin
-mountPath:/usr/local/vgpu
-name: lib
-mountPath: /tmp
-name: hosttmp

like above are the volumes configured in device-plugin daemon.
Do I need to make any changes?

archlitchi · 2024-05-08T08:02:41Z

@EswarS No, i mean, after you submit a vgpu task into volcano, please check

Is the corresponding folder "/tmp/vgpu/containers/{containerUID}_{ctrName}" exists in your corresponding GPU node.
is the folder "/tmp/vgpu" exists inside the vgpu-task container

AshinWu · 2024-06-11T04:08:27Z

I have the same problem in 1.8.1 version.
The graphics card is RTX 3090 24GB, the container is set to limits volcano.sh/vgpu-memory: '10240', when executing nvidia-smi in the container, although the graphics card displays a memory of 10240MiB, in reality the process can use more memory, such as 20480MiB.
What I expect is that the process can not use memory beyond limits, which is 10240Mib

EswarS · 2024-06-12T12:52:52Z

@archlitchi ,
we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”

do we really need to add emptydir{} volume path ?

when same pod submitted by another non root user ( namespace user) , it is not able to access the folder.
/tmp/vgpu has 777 root:root permission at node level.

here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.

Could you suggest how to handle this problem.

EswarS · 2024-06-12T19:05:32Z

Is libvgpu.so loaded successfully? Check logs of your pod which may confirm you. If already loaded problem may be with mounts, check your pod securityContext fsGroup. Just check your volume mounts on your pod and Also check permission on node /tmp/vgpu and /tmp/vgpulock folder .

…

On Tue, 11 Jun 2024 at 9:38 AM, Ashin Woo ***@***.***> wrote: I have the same problem，The graphics card is RTX 3090 24GB, the container is set to limits volcano.sh/vgpu-memory: '10240', when executing nvidia-smi in the container, although the graphics card displays a memory of 10240MiB, in reality the process can use more memory, such as 20480MiB. What I expect is that the process can not use memory beyond limits, which is 10240Mib — Reply to this email directly, view it on GitHub <#3384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLYDM525LXH73ZKUKDG5R3ZGZZ5BAVCNFSM6AAAAABFVHLK46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZG42DEMJVGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

archlitchi · 2024-06-13T01:22:58Z

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”

do we really need to add emptydir{} volume path ?

when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.

here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.

Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

EswarS · 2024-06-13T13:13:12Z

@archlitchi ,
I have one more question, why can’t we allocate vgpus more than physical gpus in a single container.

EswarS · 2024-06-13T13:21:50Z

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

Could you suggest changes, we may try to make build ourself.
Are there any workarounds for above problem?

Node level /tmp/vgpu folder has 777 permissions.
But still pod not able to access it when we are trying to access with org user.
Is there any configuration we can make at pod volume level.

archlitchi · 2024-06-14T02:08:33Z

yes, you could modify the repo here (https://github.com/Project-HAMi/volcano-vgpu-device-plugin), in file https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/pkg/plugin/vgpu/plugin.go,
line #351 change the env to 'fmt.Sprintf("/usr/local/vgpu/%v.cache", uuid.NewUUID())'
line #364 change the mountPoint to /use/local/vgpu
and build the image ,it should be working fine, plz let me know if it works

archlitchi · 2024-06-14T02:10:04Z

@archlitchi , I have one more question, why can’t we allocate vgpus more than physical gpus in a single container.

suppose you have a 4-GPUs node, there are only 4 GPUs in your /dev/ folder, we can't mount a non-existing GPU from /dev/ into the container

AshinWu · 2024-06-14T04:13:56Z

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

archlitchi · 2024-06-14T06:28:12Z

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

yes, try volcano v1.9 with this repo https://github.com/Project-HAMi/volcano-vgpu-device-plugin

EswarS · 2024-06-14T07:04:18Z

yes, you could modify the repo here (https://github.com/Project-HAMi/volcano-vgpu-device-plugin), in file https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/pkg/plugin/vgpu/plugin.go,
line #351 change the env to 'fmt.Sprintf("/usr/local/vgpu/%v.cache", uuid.NewUUID())'
line #364 change the mountPoint to /use/local/vgpu
and build the image ,it should be working fine, plz let me know if it works

It is not working, is there any possibility to interact with you directly and solve this problem.

AshinWu · 2024-06-14T07:04:59Z

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

yes, try volcano v1.9 with this repo https://github.com/Project-HAMi/volcano-vgpu-device-plugin

Thanks, I will go and try.

archlitchi · 2024-06-14T07:07:15Z

@EswarS do you have a wechat? you can add my wechat id "xuanzong4493"

EswarS · 2024-06-14T10:20:34Z

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

archlitchi · 2024-06-17T01:36:11Z

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

i just registered a linkdin account, linkedin.com/in/mengxuan-li-5862a9314, try this

EswarS · 2024-06-19T07:46:56Z

@archlitchi ,Is it possible to mount on network file system? Including vgpu cache file and vgpulock .
Node level permissions are controlled and not allowed to write(selinux enabled)

EswarS · 2024-06-19T07:53:27Z

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

i just registered a linkdin account, linkedin.com/in/mengxuan-li-5862a9314, try this

Could you please let me know your available time , so that we will schedule a meeting

kunal642 · 2024-06-24T09:58:03Z

@archlitchi is there a way to check how many cores are allocated in the container? if we configure 50% cores, then we want to make sure that only 50% is allocated

Monokaix · 2024-07-31T09:48:40Z

@archlitchi is there a way to check how many cores are allocated in the container? if we configure 50% cores, then we want to make sure that only 50% is allocated

GPU core is supported in latest version and is percent also supported? @archlitchi

Monokaix · 2024-09-14T02:00:10Z

Any progress here? @kunal642 Is your problem solved？

kunal642 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2024

kunal642 changed the title ~~Using volcano vgpu not restricting memory in the container~~ vgpu not restricting memory in the container Apr 3, 2024

volcano-sh-bot assigned archlitchi Apr 4, 2024

vgpu not restricting memory in the container #3384

vgpu not restricting memory in the container #3384

Comments

kunal642 commented Apr 3, 2024

lowang-bh commented Apr 4, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024 • edited Loading

kunal642 commented Apr 10, 2024 • edited Loading

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 • edited Loading

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 • edited Loading

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

kunal642 commented Apr 16, 2024 • edited Loading

kunal642 commented Apr 23, 2024

EswarS commented Apr 29, 2024

archlitchi commented Apr 30, 2024 • edited Loading

archlitchi commented Apr 30, 2024

kunal642 commented May 2, 2024

EswarS commented May 3, 2024

archlitchi commented May 6, 2024

archlitchi commented May 6, 2024

EswarS commented May 8, 2024 • edited Loading

archlitchi commented May 8, 2024

AshinWu commented Jun 11, 2024 • edited Loading

EswarS commented Jun 12, 2024 • edited Loading

EswarS commented Jun 12, 2024 via email

archlitchi commented Jun 13, 2024 • edited Loading

EswarS commented Jun 13, 2024

EswarS commented Jun 13, 2024

archlitchi commented Jun 14, 2024

archlitchi commented Jun 14, 2024

AshinWu commented Jun 14, 2024

archlitchi commented Jun 14, 2024

EswarS commented Jun 14, 2024

AshinWu commented Jun 14, 2024

archlitchi commented Jun 14, 2024

EswarS commented Jun 14, 2024

archlitchi commented Jun 17, 2024

EswarS commented Jun 19, 2024 • edited Loading

EswarS commented Jun 19, 2024

kunal642 commented Jun 24, 2024

Monokaix commented Jul 31, 2024

Monokaix commented Sep 14, 2024

archlitchi commented Apr 10, 2024 •

edited

Loading

kunal642 commented Apr 10, 2024 •

edited

Loading

kunal642 commented Apr 10, 2024 •

edited

Loading

kunal642 commented Apr 10, 2024 •

edited

Loading

kunal642 commented Apr 16, 2024 •

edited

Loading

archlitchi commented Apr 30, 2024 •

edited

Loading

EswarS commented May 8, 2024 •

edited

Loading

AshinWu commented Jun 11, 2024 •

edited

Loading

EswarS commented Jun 12, 2024 •

edited

Loading

archlitchi commented Jun 13, 2024 •

edited

Loading

EswarS commented Jun 19, 2024 •

edited

Loading