Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

untraceable GPU memory allocation #5

Open
zw0610 opened this issue Apr 2, 2020 · 13 comments
Open

untraceable GPU memory allocation #5

zw0610 opened this issue Apr 2, 2020 · 13 comments

Comments

@zw0610
Copy link

zw0610 commented Apr 2, 2020

Describe the bug

When I was testing triton inference server 19.10, GPU memory usage increases when the following two functions are called:

  1. cuCtxGetCurrent
  2. cuModuleGetFunction

It seems when loading cuda module, some data is transmitted into GPU memory without any function calls described within Memory Manage.

Despite the fact that any following cuMemAlloc call will be prevented if untraceable GPU memory allocation has already surpassed the limit set by user, it still seems a flaw that the actual GPU memory usage may exceed limit.

Environment
OS: Linux kube-node-zw 3.10.0-1062.18.1.el7.x86_64 # 1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

GPU Info: NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2

@mYmNeo
Copy link
Contributor

mYmNeo commented Apr 3, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

@zw0610
Copy link
Author

zw0610 commented Apr 7, 2020

@mYmNeo Thank you so much for addressing the question. I wish your resolution will overcome the benchmark issue and be public to the community.

@hyc3z
Copy link

hyc3z commented May 20, 2020

Hi, @mYmNeo , I also found this problem, like I tried in my own small project:
https://github.com/hyc3z/cuda-w-mem-watcher
I set the limit to 2147483648 , which is exactly 2GB
however, when I watch nvidia-smi on real host,
It seems that when I run tensorflow samples, it will use more than 2.5GB before triggering OOM caused by returning CUDA_ERROR_OUT_OF_MEMORY
I tried setting limit to 1GB, and still there are 500MB more.
Then I tried not allowing any allocation through memory driver api,
After some initializing procedures, it still consumed about 250M of memory before the process going down.

@mYmNeo
Copy link
Contributor

mYmNeo commented May 20, 2020

https://github.com/hyc3z/cuda-w-mem-watcher

Can you provide the driver apis which your program used?

@hyc3z
Copy link

hyc3z commented May 20, 2020

@mYmNeo I'm using the tensorflow/tensorflow:latest-gpu-py3 docker image, which comes with
python 3.6.9 and tensorflow-gpu 2.1.0.
The test script I use is
https://github.com/tensorflow/benchmarks

@hyc3z
Copy link

hyc3z commented May 20, 2020

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

@nlnjnj
Copy link

nlnjnj commented Sep 3, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

@mYmNeo
Copy link
Contributor

mYmNeo commented Sep 4, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

@nlnjnj
Copy link

nlnjnj commented Sep 4, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

@mYmNeo
Copy link
Contributor

mYmNeo commented Sep 4, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Did you try run 2 trainer with one single card? And any error occurred?

@nlnjnj
Copy link

nlnjnj commented Sep 4, 2020

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Did you try run 2 trainer with one single card? And any error occurred?

Yes the error is shows that cuInit error no CUDA-capable device is detected, and I recently occur this error even on I runing one trainer.

For more details you could contact me via wechat: nlnjnj

@zw0610
Copy link
Author

zw0610 commented Sep 4, 2020

@nlnjnj I had similar errors before. However, I believe such issue may be caused merely by hijacking CUDA API. You might take a test by cuMemAlloc small piece of data, preventing the program from exceeding the memory limit. In my experience, it would still occur.

@Huoyuan100861
Copy link

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

Hi, do you replace libcuda.so.1 in Tensorflow successfully? If do, can you share your way? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants