GPU memory usage VS GPU utilization #348

kirilllzaitsev · 2023-05-15T06:04:56Z

Hi, I observe the following system metrics:

As I expect the GPU to be highly utilized given the used memory, what is the correct intuition for this metric to be that low? Does ImagenTrainer that I use creates puts loads of unused (during training) objects to GPU?

TheFusion21 · 2023-05-15T07:50:22Z

Cause of the problem could be:

Loading of batches is slow
Model is to small to utilize the entire GPU ( increase batch size)
some different bottleneck (cpu, pci-e link, etc)

kirilllzaitsev · 2023-05-16T11:23:44Z

@TheFusion21 , thank you for the suggestions. But I still can't explain why GPU usage stays at ~25%, while GPU memory (which is a blocker for using larger batch size, model, etc. due to out-of-memory errors) is almost 100% of available 24Gb.

FriedRonaldo · 2023-06-15T17:29:21Z

In most cases, the major bottleneck is in the data loader. If your input images are too large or require something complex to do in the training phase, the GPU process should wait for the entire CPU process.

To resolve this issue, you can pre-process all the images before the training. (ex. making a smaller copy of training images -- 64x64 before you start the training.)

Or, if you use multi-node to train the model, the communication between the nodes might raise this issue. (it might be from the slow intra-network between the nodes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory usage VS GPU utilization #348

GPU memory usage VS GPU utilization #348

kirilllzaitsev commented May 15, 2023

TheFusion21 commented May 15, 2023

kirilllzaitsev commented May 16, 2023

FriedRonaldo commented Jun 15, 2023

GPU memory usage VS GPU utilization #348

GPU memory usage VS GPU utilization #348

Comments

kirilllzaitsev commented May 15, 2023

TheFusion21 commented May 15, 2023

kirilllzaitsev commented May 16, 2023

FriedRonaldo commented Jun 15, 2023