Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu not restricting memory in the container #3384

Open
kunal642 opened this issue Apr 3, 2024 · 45 comments
Open

vgpu not restricting memory in the container #3384

kunal642 opened this issue Apr 3, 2024 · 45 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kunal642
Copy link

kunal642 commented Apr 3, 2024

What happened:

When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)

What you expected to happen:

The memory inside the container should be limited to vgpu-memory configuration.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Nvidia-smi version: 545.23.08
MIG M: NA

Environment:

  • Volcano Version: 1.8.x
  • Kubernetes version (use kubectl version): v1.28.x
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kunal642 kunal642 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2024
@kunal642 kunal642 changed the title Using volcano vgpu not restricting memory in the container vgpu not restricting memory in the container Apr 3, 2024
@lowang-bh
Copy link
Member

/assign @archlitchi

@kunal642
Copy link
Author

Hey @archlitchi, Can you suggest something for this?

@archlitchi
Copy link
Contributor

archlitchi commented Apr 10, 2024

Hey @archlitchi, Can you suggest something for this?

could you provide the following information:

  1. The vgpu-task yaml you submitted?
  2. "env" result inside container

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 200

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

@archlitchi
Copy link
Contributor

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 3000

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

Could you provide the 'env' result inside container?

@kunal642
Copy link
Author

wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you

@archlitchi
Copy link
Contributor

okay, please list env which contains keyword 'CUDA' or 'NVIDIA'

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

@archlitchi
Copy link
Contributor

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

@archlitchi
Copy link
Contributor

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:

  1. /usr/local/vgpu/libvgpu.so
  2. /etc/ld.so.preload

@kunal642
Copy link
Author

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size
/etc/ld.so.preload > does not exist

@archlitchi
Copy link
Contributor

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist

okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml

@kunal642
Copy link
Author

okay, let me try this!!!

@kunal642
Copy link
Author

kunal642 commented Apr 16, 2024

Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then?

@kunal642
Copy link
Author

Hey @archlitchi, any other suggestions to fix this?

@EswarS
Copy link

EswarS commented Apr 29, 2024

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@archlitchi
Copy link
Contributor

archlitchi commented Apr 30, 2024

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@kunal642

ok, i'm looking into it now, sorry i didn't see your replies last two weeks

@archlitchi
Copy link
Contributor

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

@kunal642
Copy link
Author

kunal642 commented May 2, 2024

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

@EswarS
Copy link

EswarS commented May 3, 2024

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin
Facing following error
Initializing …..
Fail to open shrreg ***.cache (errorno:11)
Fail to init shrreg ****.cache (errorno:9)
Fail to write shrreg ***.cache (errorno:9)
Fail to reseek shrreg ***.cache (errorno:9)
Fail to lock shrreg ***.cache (errorno:9)

@archlitchi
Copy link
Contributor

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

yes, can you run your task now?

@archlitchi
Copy link
Contributor

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)

The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists

@EswarS
Copy link

EswarS commented May 8, 2024

VolumeMounts:
-mountPath: /car/lib/kubelet/device-plugins
-name: device-plugin
-mountPath:/usr/local/vgpu
-name: lib
-mountPath: /tmp
-name: hosttmp

like above are the volumes configured in device-plugin daemon.
Do I need to make any changes?

@archlitchi
Copy link
Contributor

@EswarS No, i mean, after you submit a vgpu task into volcano, please check

  1. Is the corresponding folder "/tmp/vgpu/containers/{containerUID}_{ctrName}" exists in your corresponding GPU node.
  2. is the folder "/tmp/vgpu" exists inside the vgpu-task container

@AshinWu
Copy link

AshinWu commented Jun 11, 2024

I have the same problem in 1.8.1 version.
The graphics card is RTX 3090 24GB, the container is set to limits volcano.sh/vgpu-memory: '10240', when executing nvidia-smi in the container, although the graphics card displays a memory of 10240MiB, in reality the process can use more memory, such as 20480MiB.
What I expect is that the process can not use memory beyond limits, which is 10240Mib

@EswarS
Copy link

EswarS commented Jun 12, 2024

@archlitchi ,
we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”

do we really need to add emptydir{} volume path ?

when same pod submitted by another non root user ( namespace user) , it is not able to access the folder.
/tmp/vgpu has 777 root:root permission at node level.

here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.

Could you suggest how to handle this problem.

@EswarS
Copy link

EswarS commented Jun 12, 2024 via email

@archlitchi
Copy link
Contributor

archlitchi commented Jun 13, 2024

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”

do we really need to add emptydir{} volume path ?

when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.

here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.

Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

@EswarS
Copy link

EswarS commented Jun 13, 2024

@archlitchi ,
I have one more question, why can’t we allocate vgpus more than physical gpus in a single container.

@EswarS
Copy link

EswarS commented Jun 13, 2024

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.

okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version

Could you suggest changes, we may try to make build ourself.
Are there any workarounds for above problem?

Node level /tmp/vgpu folder has 777 permissions.
But still pod not able to access it when we are trying to access with org user.
Is there any configuration we can make at pod volume level.

@archlitchi
Copy link
Contributor

yes, you could modify the repo here (https://github.com/Project-HAMi/volcano-vgpu-device-plugin), in file https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/pkg/plugin/vgpu/plugin.go,
line #351 change the env to 'fmt.Sprintf("/usr/local/vgpu/%v.cache", uuid.NewUUID())'
line #364 change the mountPoint to /use/local/vgpu
and build the image ,it should be working fine, plz let me know if it works

@archlitchi
Copy link
Contributor

@archlitchi , I have one more question, why can’t we allocate vgpus more than physical gpus in a single container.

suppose you have a 4-GPUs node, there are only 4 GPUs in your /dev/ folder, we can't mount a non-existing GPU from /dev/ into the container

@AshinWu
Copy link

AshinWu commented Jun 14, 2024

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

@archlitchi
Copy link
Contributor

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

yes, try volcano v1.9 with this repo https://github.com/Project-HAMi/volcano-vgpu-device-plugin

@EswarS
Copy link

EswarS commented Jun 14, 2024

yes, you could modify the repo here (https://github.com/Project-HAMi/volcano-vgpu-device-plugin), in file https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/pkg/plugin/vgpu/plugin.go,
line #351 change the env to 'fmt.Sprintf("/usr/local/vgpu/%v.cache", uuid.NewUUID())'
line #364 change the mountPoint to /use/local/vgpu
and build the image ,it should be working fine, plz let me know if it works

It is not working, is there any possibility to interact with you directly and solve this problem.

@AshinWu
Copy link

AshinWu commented Jun 14, 2024

Is there a solution currently for the issue of 'vgpu not restricting memory in the container'?

yes, try volcano v1.9 with this repo https://github.com/Project-HAMi/volcano-vgpu-device-plugin

Thanks, I will go and try.

@archlitchi
Copy link
Contributor

@EswarS do you have a wechat? you can add my wechat id "xuanzong4493"

@EswarS
Copy link

EswarS commented Jun 14, 2024

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

@archlitchi
Copy link
Contributor

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

i just registered a linkdin account, linkedin.com/in/mengxuan-li-5862a9314, try this

@EswarS
Copy link

EswarS commented Jun 19, 2024

@archlitchi ,Is it possible to mount on network file system? Including vgpu cache file and vgpulock .
Node level permissions are controlled and not allowed to write(selinux enabled)

@EswarS
Copy link

EswarS commented Jun 19, 2024

WeChat is not accessible, is there any other app helps WhatsApp, telegram, Fb, linkdin,zoom ..etc

i just registered a linkdin account, linkedin.com/in/mengxuan-li-5862a9314, try this

Could you please let me know your available time , so that we will schedule a meeting

@kunal642
Copy link
Author

@archlitchi is there a way to check how many cores are allocated in the container? if we configure 50% cores, then we want to make sure that only 50% is allocated

@Monokaix
Copy link
Member

@archlitchi is there a way to check how many cores are allocated in the container? if we configure 50% cores, then we want to make sure that only 50% is allocated

GPU core is supported in latest version and is percent also supported? @archlitchi

@Monokaix
Copy link
Member

Any progress here? @kunal642 Is your problem solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants