Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory usage VS GPU utilization #348

Open
kirilllzaitsev opened this issue May 15, 2023 · 3 comments
Open

GPU memory usage VS GPU utilization #348

kirilllzaitsev opened this issue May 15, 2023 · 3 comments

Comments

@kirilllzaitsev
Copy link

Hi, I observe the following system metrics:

image

As I expect the GPU to be highly utilized given the used memory, what is the correct intuition for this metric to be that low? Does ImagenTrainer that I use creates puts loads of unused (during training) objects to GPU?

@TheFusion21
Copy link
Contributor

Cause of the problem could be:

  1. Loading of batches is slow
  2. Model is to small to utilize the entire GPU ( increase batch size)
  3. some different bottleneck (cpu, pci-e link, etc)

@kirilllzaitsev
Copy link
Author

@TheFusion21 , thank you for the suggestions. But I still can't explain why GPU usage stays at ~25%, while GPU memory (which is a blocker for using larger batch size, model, etc. due to out-of-memory errors) is almost 100% of available 24Gb.

@FriedRonaldo
Copy link

In most cases, the major bottleneck is in the data loader. If your input images are too large or require something complex to do in the training phase, the GPU process should wait for the entire CPU process.

To resolve this issue, you can pre-process all the images before the training. (ex. making a smaller copy of training images -- 64x64 before you start the training.)

Or, if you use multi-node to train the model, the communication between the nodes might raise this issue. (it might be from the slow intra-network between the nodes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants