untraceable GPU memory allocation #5

zw0610 · 2020-04-02T08:19:34Z

Describe the bug

When I was testing triton inference server 19.10, GPU memory usage increases when the following two functions are called:

cuCtxGetCurrent
cuModuleGetFunction

It seems when loading cuda module, some data is transmitted into GPU memory without any function calls described within Memory Manage.

Despite the fact that any following cuMemAlloc call will be prevented if untraceable GPU memory allocation has already surpassed the limit set by user, it still seems a flaw that the actual GPU memory usage may exceed limit.

Environment
OS: Linux kube-node-zw 3.10.0-1062.18.1.el7.x86_64 # 1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

GPU Info: NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2

The text was updated successfully, but these errors were encountered:

mYmNeo · 2020-04-03T01:56:05Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

zw0610 · 2020-04-07T09:18:41Z

@mYmNeo Thank you so much for addressing the question. I wish your resolution will overcome the benchmark issue and be public to the community.

hyc3z · 2020-05-20T01:02:56Z

Hi, @mYmNeo , I also found this problem, like I tried in my own small project:
https://github.com/hyc3z/cuda-w-mem-watcher
I set the limit to 2147483648 , which is exactly 2GB
however, when I watch nvidia-smi on real host,
It seems that when I run tensorflow samples, it will use more than 2.5GB before triggering OOM caused by returning CUDA_ERROR_OUT_OF_MEMORY
I tried setting limit to 1GB, and still there are 500MB more.
Then I tried not allowing any allocation through memory driver api,
After some initializing procedures, it still consumed about 250M of memory before the process going down.

mYmNeo · 2020-05-20T10:40:08Z

https://github.com/hyc3z/cuda-w-mem-watcher

Can you provide the driver apis which your program used?

hyc3z · 2020-05-20T12:44:56Z

@mYmNeo I'm using the tensorflow/tensorflow:latest-gpu-py3 docker image, which comes with
python 3.6.9 and tensorflow-gpu 2.1.0.
The test script I use is
https://github.com/tensorflow/benchmarks

hyc3z · 2020-05-20T12:46:53Z

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

nlnjnj · 2020-09-03T09:06:55Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

mYmNeo · 2020-09-04T01:21:34Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

nlnjnj · 2020-09-04T01:51:20Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

mYmNeo · 2020-09-04T07:30:39Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Did you try run 2 trainer with one single card? And any error occurred?

nlnjnj · 2020-09-04T07:34:45Z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Did you try run 2 trainer with one single card? And any error occurred?

Yes the error is shows that cuInit error no CUDA-capable device is detected, and I recently occur this error even on I runing one trainer.

For more details you could contact me via wechat: nlnjnj

zw0610 · 2020-09-04T07:42:41Z

@nlnjnj I had similar errors before. However, I believe such issue may be caused merely by hijacking CUDA API. You might take a test by cuMemAlloc small piece of data, preventing the program from exceeding the memory limit. In my experience, it would still occur.

Huoyuan100861 · 2021-12-06T08:50:40Z

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

Hi, do you replace libcuda.so.1 in Tensorflow successfully? If do, can you share your way? Thanks!

work-chausat mentioned this issue Jul 5, 2022

some issues on cuda-11.4 #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

untraceable GPU memory allocation #5

untraceable GPU memory allocation #5

zw0610 commented Apr 2, 2020

mYmNeo commented Apr 3, 2020 •

edited

Loading

zw0610 commented Apr 7, 2020 •

edited

Loading

hyc3z commented May 20, 2020

mYmNeo commented May 20, 2020

hyc3z commented May 20, 2020

hyc3z commented May 20, 2020

nlnjnj commented Sep 3, 2020

mYmNeo commented Sep 4, 2020

nlnjnj commented Sep 4, 2020 •

edited

Loading

mYmNeo commented Sep 4, 2020

nlnjnj commented Sep 4, 2020 •

edited

Loading

zw0610 commented Sep 4, 2020

Huoyuan100861 commented Dec 6, 2021

untraceable GPU memory allocation #5

untraceable GPU memory allocation #5

Comments

zw0610 commented Apr 2, 2020

mYmNeo commented Apr 3, 2020 • edited Loading

zw0610 commented Apr 7, 2020 • edited Loading

hyc3z commented May 20, 2020

mYmNeo commented May 20, 2020

hyc3z commented May 20, 2020

hyc3z commented May 20, 2020

nlnjnj commented Sep 3, 2020

mYmNeo commented Sep 4, 2020

nlnjnj commented Sep 4, 2020 • edited Loading

mYmNeo commented Sep 4, 2020

nlnjnj commented Sep 4, 2020 • edited Loading

zw0610 commented Sep 4, 2020

Huoyuan100861 commented Dec 6, 2021

mYmNeo commented Apr 3, 2020 •

edited

Loading

zw0610 commented Apr 7, 2020 •

edited

Loading

nlnjnj commented Sep 4, 2020 •

edited

Loading

nlnjnj commented Sep 4, 2020 •

edited

Loading