-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
untraceable GPU memory allocation #5
Comments
The GPU memory stores not only data but also code, and state in it. The NVIDIA API of |
@mYmNeo Thank you so much for addressing the question. I wish your resolution will overcome the benchmark issue and be public to the community. |
Hi, @mYmNeo , I also found this problem, like I tried in my own small project: |
Can you provide the driver apis which your program used? |
@mYmNeo I'm using the tensorflow/tensorflow:latest-gpu-py3 docker image, which comes with |
@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols. |
I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release? (pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @ 0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034) @ 0x7f91c346a8a0 (unknown)
(pid=26034) @ 0x7f91c30a5f47 gsignal
(pid=26034) @ 0x7f91c30a78b1 abort
(pid=26034) @ 0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034) @ 0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034) @ 0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034) @ 0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034) @ 0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034) @ 0x7f91c30aa0f1 (unknown)
(pid=26034) @ 0x7f91c30aa1ea exit
(pid=26034) @ 0x7f7e0c2ff497 initialization
(pid=26034) @ 0x7f91c3467827 __pthread_once_slow
(pid=26034) @ 0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @ 0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030) @ 0x7eff48c758a0 (unknown)
(pid=26030) @ 0x7eff488b0f47 gsignal
(pid=26030) @ 0x7eff488b28b1 abort
(pid=26030) @ 0x7eff474bc441 google::LogMessage::Flush()
(pid=26030) @ 0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030) @ 0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034) @ 0x7f7e97d55da0 cuInit
(pid=26030) @ 0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034) @ 0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030) @ 0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030) @ 0x7eff488b50f1 (unknown) |
Your problem is not the memory allocation. The log shows that |
Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause. |
Did you try run 2 trainer with one single card? And any error occurred? |
Yes the error is shows that For more details you could contact me via wechat: nlnjnj |
@nlnjnj I had similar errors before. However, I believe such issue may be caused merely by hijacking CUDA API. You might take a test by |
Hi, do you replace libcuda.so.1 in Tensorflow successfully? If do, can you share your way? Thanks! |
Describe the bug
When I was testing triton inference server 19.10, GPU memory usage increases when the following two functions are called:
It seems when loading cuda module, some data is transmitted into GPU memory without any function calls described within Memory Manage.
Despite the fact that any following
cuMemAlloc
call will be prevented if untraceable GPU memory allocation has already surpassed the limit set by user, it still seems a flaw that the actual GPU memory usage may exceed limit.Environment
OS: Linux kube-node-zw 3.10.0-1062.18.1.el7.x86_64 # 1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
GPU Info: NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2
The text was updated successfully, but these errors were encountered: