Low CPU and GPU utilization when running in the gpu mode. #15

wesley-stone · 2021-02-14T14:04:24Z

Here is my script:
python run.py --adhoc --cfg conf/c02_selfplay/liars_sp.yaml env.num_dice=1 env.num_faces=4 env.subgame_params.use_cfr=true selfplay.cpu_gen_threads=0 selfplay.threads_per_gpu=16

My computer configuration is 22080ti + 2cpu with 24 thread per cpu.
The log seems quite normal, just collecting experience and training. However, I find that the utilization of cpu is only 5% and the gpu is 6%. Is this normal? And could you tell me how can I fasten the process? Thanks a lot!

Drazcmd · 2024-01-31T07:45:48Z

EDIT: ignore the first part of this, I misread 2 2080ti as "2080ti", i.e. 1 x 2080ti (sorry for the confusion!). That said, the other part about reserving one of them the model might be relevant?

TLDR: Based on what's in the README, + some of the closed issues and parts of cfvpy/selfplay.py, I think the intended use case for non-zero selfplay.threads_per_gpu and zero selfplay.cpu_gen_threads is actually when you have at least TWO GPUs.

(And, if you only have one GPU, that's what the CPU-based data generation, i.e. setting selfplay.cpu_gen_threads=60 in the example, is for)

Having said all that, some comments by the authors do seem to suggest that it's probably ok to modify the relevant selfplay.py code to also use GPU 0 for data generation -> https://github.com/facebookresearch/rebel/blob/master/cfvpy/selfplay.py#L193. So might be worth trying that out perhaps?

Longer explanation: I'm not 100% certain about any of this (I'm a complete newbie to CUDA / cuDNN), but after reading through some of the closed issues + the code, I think:

the 'gpu mode' they talk about is currently intended for two or more gpus. Specifically, if I understand their comments correctly, the code path you enter in that situation uses the 0th GPU for the model, then uses all remaining GPUs for the expensive "data generation" part that's actually running CFR. That said, based on their comments, it's also pretty easy to change the code to just NOT do that? (note: I'm not sure of the consequences for doing so)
the 'cpu mode' (setting cpu threads) is meant for when you have exactly one gpu - with the result being that it reserves your entire GPU for the model (maybe? not sure!) - and then starts using your CPU for data generation. But since the data generation is the expensive part, this will probably be very slow.
(As for if you don't have a CUDA-compatible GPU, I think might just be that you can't run this at all? Again, not certain!)

On a side note, for anyone else poking around this stuff in 2024, here's a couple quick things I've noticed:

the current selfplay.py file has a weird assert that I believe causes it to crash if there's not >= 2 CUDA compatible GPUs on the system, regardless of if you set the selfplay.cpu_gen_threads stuff. Pretty sure that's a bug / it should only be asserting there's >=1 GPU at that spot (and instead asserting >= 2 only when in the "gpu" code path below)?
a bunch of the install requirements are outdated / there's some strange bugs. Working with a relatively fresh ubuntu 18.04 pro install, I had to first
- 1. install the last cuDNN 7.X (cuDNN 8.x had problems I think?) and a corresponding version of CUDA (ensuring CUDA specifically was in /usr/local/cuda since otherwise one of the pytorch 14 dependencies couldn't find it),
- 1. use specifically python 3.7
- 1. temporarily remove all of the test files from liars_dice (since there was some weird issue with types being incompatible?... still need to actually figure out what exactly was going on there)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low CPU and GPU utilization when running in the gpu mode. #15

Low CPU and GPU utilization when running in the gpu mode. #15

wesley-stone commented Feb 14, 2021

Drazcmd commented Jan 31, 2024 •

edited

Loading

Low CPU and GPU utilization when running in the gpu mode. #15

Low CPU and GPU utilization when running in the gpu mode. #15

Comments

wesley-stone commented Feb 14, 2021

Drazcmd commented Jan 31, 2024 • edited Loading

Drazcmd commented Jan 31, 2024 •

edited

Loading