Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda out of memory in validation #1432

Open
ChenDirk opened this issue Mar 30, 2022 · 8 comments
Open

Cuda out of memory in validation #1432

ChenDirk opened this issue Mar 30, 2022 · 8 comments
Assignees
Labels
awaiting response documentation Improvements or additions to documentation FAQ

Comments

@ChenDirk
Copy link

I use 8 GPUS to train one model. During the training, the cuda occupied about 19K MiB (24k total), but in validation, it needs more than 24k MiB, out of memory and stop training. In training, the random crop size is 512,512, and in validation and test, the data will be resize to 512,512 (keep_ratio=False)。At the begining, I think it's due to the softmax layer in inference, because the number of class is very large(194). So I remove it, but do not fix the problem. Can you tell me other possible reasons for the problem? Thanks a lot!

@MeowZheng
Copy link
Collaborator

It would be better to provide the cfg file you run, and we can repeat the problem you met

@MeowZheng MeowZheng self-assigned this Mar 31, 2022
@linfangjian01
Copy link
Contributor

Hi, during training, you can set crop size to 512x512 to ensure that the input image size is 512x512. However, in the process of validation, if the 'whole' inference is used, the size of the input image may change.
You have two questions to check:

  1. test_cfg is 'mode='whole' or 'mode=slide'.
  2. print the image size of the input network to check for size changes during validation.

@ChenDirk
Copy link
Author

Hi, I print the shape of img in the whole_inference function in mmseg/models/segmentors/encoder_decoder.py, and all the shape are torch.Size([1, 3, 512, 512]), but I still meet the same OOM problem. During the training, the gpu memory is only occupied 11GB but during validation, more than 24GB. I don't think it is the problem of input size, because I set keep_ratio=False when I resize the test data. BTW, when I use tools/dist_test.sh to test with the last pth file, there is no OOM problem, I don't know the reason...

@ChenDirk
Copy link
Author

Hi, I print the shape of img in the whole_inference function in mmseg/models/segmentors/encoder_decoder.py, and all the shape are torch.Size([1, 3, 512, 512]), but I still meet the same OOM problem. During the training, the gpu memory is only occupied 11GB but during validation, more than 24GB. I don't think it is the problem of input size, because I set keep_ratio=False when I resize the test data. BTW, when I use tools/dist_test.sh to test with the last pth file, there is no OOM problem, I don't know the reason...

It would be better to provide the cfg file you run, and we can repeat the problem you met

I use the custom data and custom model, so I think it's very hard to repeat the problem. Now, my solution is set --no-validate during the training and test the performance with all the checkpoint files after training, but it looks like stupied...

@YuriyKortev
Copy link

YuriyKortev commented Apr 21, 2022

With same problem here, im using mode='slide', current implementation of slide inference first moves whole images on gpu and only then slices them on crops, i think combining this implementation with MultiScaleFlipAug gives crazy consuming of gpu memory. Is there any ideas how to deal with this?

@MeowZheng MeowZheng added documentation Improvements or additions to documentation FAQ labels May 3, 2022
@hexuedong1117
Copy link

I also had the same problem。I have three devices, two of which run the same code without any issues, but the third device will have a GPU memory abnormal growth when testing.And the occupancy of GPU memory keeps changing. After I observed the size of the input image, the memory change does not seem to be related to the size of the image.(The same goes for the size of the input on other machines. And the GPU memory occupied by other machines during the test does not change and does not use a lot of memory.)

@ChenDirk
Copy link
Author

I also had the same problem。I have three devices, two of which run the same code without any issues, but the third device will have a GPU memory abnormal growth when testing.And the occupancy of GPU memory keeps changing. After I observed the size of the input image, the memory change does not seem to be related to the size of the image.(The same goes for the size of the input on other machines. And the GPU memory occupied by other machines during the test does not change and does not use a lot of memory.)

Hi,
I also try to use cityscapes dataset, and when I try my model in this dataset, there is no OOM error in validation. So I guess the OOM error due to the different image size in the dataset. Maybe the OOM problem due to the resize operation, because the image size in cityscapes is very large, but all the images have the same size and it can also work very well....

@hexuedong1117
Copy link

Thank you very much. What you said is very useful, your idea should be correct.After I used your method, there are no more errors.

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023
* feat : add log-rho deis multistep deis

* docs :fix typo

* docs : add docs for impl algo

* docs : remove duplicate ref

* finish deis

* add docs

* fix

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response documentation Improvements or additions to documentation FAQ
Projects
None yet
Development

No branches or pull requests

5 participants