-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vgpu not restricting memory in the container #3384
Comments
/assign @archlitchi |
Hey @archlitchi, Can you suggest something for this? |
could you provide the following information:
|
NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu |
Could you provide the 'env' result inside container? |
wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you |
okay, please list env which contains keyword 'CUDA' or 'NVIDIA' |
Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me
|
em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04 |
sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well |
Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:
|
/usr/local/vgpu/libvgpu.so -> Exists with non 0 size |
okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml |
okay, let me try this!!! |
Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then? |
Hey @archlitchi, any other suggestions to fix this? |
Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance. |
ok, i'm looking into it now, sorry i didn't see your replies last two weeks |
@archlitchi is the usage same for vgpu-memory and vgpu-number configurations? |
Is this device plugin compatible with volcano 1.8.2 release package. I deployed the device plugin |
yes, can you run your task now? |
The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists |
VolumeMounts: like above are the volumes configured in device-plugin daemon. |
@EswarS No, i mean, after you submit a vgpu task into volcano, please check
|
I have the same problem in 1.8.1 version. |
@archlitchi , do we really need to add emptydir{} volume path ? when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group. Could you suggest how to handle this problem. |
Is libvgpu.so loaded successfully?
Check logs of your pod which may confirm you.
If already loaded problem may be with mounts, check your pod
securityContext fsGroup.
Just check your volume mounts on your pod and Also check permission on node
/tmp/vgpu and /tmp/vgpulock folder .
…On Tue, 11 Jun 2024 at 9:38 AM, Ashin Woo ***@***.***> wrote:
I have the same problem,The graphics card is RTX 3090 24GB, the container
is set to limits volcano.sh/vgpu-memory: '10240', when executing
nvidia-smi in the container, although the graphics card displays a memory
of 10240MiB, in reality the process can use more memory, such as 20480MiB.
What I expect is that the process can not use memory beyond limits, which
is 10240Mib
—
Reply to this email directly, view it on GitHub
<#3384 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLYDM525LXH73ZKUKDG5R3ZGZZ5BAVCNFSM6AAAAABFVHLK46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZG42DEMJVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version |
@archlitchi , |
Could you suggest changes, we may try to make build ourself. Node level /tmp/vgpu folder has 777 permissions. |
yes, you could modify the repo here (https://github.com/Project-HAMi/volcano-vgpu-device-plugin), in file https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/pkg/plugin/vgpu/plugin.go, |
suppose you have a 4-GPUs node, there are only 4 GPUs in your /dev/ folder, we can't mount a non-existing GPU from /dev/ into the container |
Is there a solution currently for the issue of 'vgpu not restricting memory in the container'? |
yes, try volcano v1.9 with this repo https://github.com/Project-HAMi/volcano-vgpu-device-plugin |
It is not working, is there any possibility to interact with you directly and solve this problem. |
Thanks, I will go and try. |
@EswarS do you have a wechat? you can add my wechat id "xuanzong4493" |
WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc |
i just registered a linkdin account, linkedin.com/in/mengxuan-li-5862a9314, try this |
@archlitchi ,Is it possible to mount on network file system? Including vgpu cache file and vgpulock . |
Could you please let me know your available time , so that we will schedule a meeting |
@archlitchi is there a way to check how many cores are allocated in the container? if we configure 50% cores, then we want to make sure that only 50% is allocated |
GPU core is supported in latest version and is percent also supported? @archlitchi |
Any progress here? @kunal642 Is your problem solved? |
What happened:
When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)
What you expected to happen:
The memory inside the container should be limited to vgpu-memory configuration.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Nvidia-smi version: 545.23.08
MIG M: NA
Environment:
kubectl version
): v1.28.xuname -a
):The text was updated successfully, but these errors were encountered: