Cuda out of memory in validation #1432

ChenDirk · 2022-03-30T01:48:21Z

I use 8 GPUS to train one model. During the training, the cuda occupied about 19K MiB (24k total), but in validation, it needs more than 24k MiB, out of memory and stop training. In training, the random crop size is 512,512, and in validation and test, the data will be resize to 512,512 (keep_ratio=False)。At the begining, I think it's due to the softmax layer in inference, because the number of class is very large(194). So I remove it, but do not fix the problem. Can you tell me other possible reasons for the problem? Thanks a lot!

MeowZheng · 2022-03-30T14:49:40Z

It would be better to provide the cfg file you run, and we can repeat the problem you met

linfangjian01 · 2022-03-31T17:18:03Z

Hi, during training, you can set crop size to 512x512 to ensure that the input image size is 512x512. However, in the process of validation, if the 'whole' inference is used, the size of the input image may change.
You have two questions to check:

test_cfg is 'mode='whole' or 'mode=slide'.
print the image size of the input network to check for size changes during validation.

ChenDirk · 2022-04-12T13:14:37Z

Hi, I print the shape of img in the whole_inference function in mmseg/models/segmentors/encoder_decoder.py, and all the shape are torch.Size([1, 3, 512, 512]), but I still meet the same OOM problem. During the training, the gpu memory is only occupied 11GB but during validation, more than 24GB. I don't think it is the problem of input size, because I set keep_ratio=False when I resize the test data. BTW, when I use tools/dist_test.sh to test with the last pth file, there is no OOM problem, I don't know the reason...

ChenDirk · 2022-04-14T03:24:44Z

Hi, I print the shape of img in the whole_inference function in mmseg/models/segmentors/encoder_decoder.py, and all the shape are torch.Size([1, 3, 512, 512]), but I still meet the same OOM problem. During the training, the gpu memory is only occupied 11GB but during validation, more than 24GB. I don't think it is the problem of input size, because I set keep_ratio=False when I resize the test data. BTW, when I use tools/dist_test.sh to test with the last pth file, there is no OOM problem, I don't know the reason...

It would be better to provide the cfg file you run, and we can repeat the problem you met

I use the custom data and custom model, so I think it's very hard to repeat the problem. Now, my solution is set --no-validate during the training and test the performance with all the checkpoint files after training, but it looks like stupied...

YuriyKortev · 2022-04-21T18:29:43Z

With same problem here, im using mode='slide', current implementation of slide inference first moves whole images on gpu and only then slices them on crops, i think combining this implementation with MultiScaleFlipAug gives crazy consuming of gpu memory. Is there any ideas how to deal with this?

hexuedong1117 · 2022-05-26T10:05:20Z

I also had the same problem。I have three devices, two of which run the same code without any issues, but the third device will have a GPU memory abnormal growth when testing.And the occupancy of GPU memory keeps changing. After I observed the size of the input image, the memory change does not seem to be related to the size of the image.(The same goes for the size of the input on other machines. And the GPU memory occupied by other machines during the test does not change and does not use a lot of memory.)

ChenDirk · 2022-05-26T13:00:37Z

I also had the same problem。I have three devices, two of which run the same code without any issues, but the third device will have a GPU memory abnormal growth when testing.And the occupancy of GPU memory keeps changing. After I observed the size of the input image, the memory change does not seem to be related to the size of the image.(The same goes for the size of the input on other machines. And the GPU memory occupied by other machines during the test does not change and does not use a lot of memory.)

Hi,
I also try to use cityscapes dataset, and when I try my model in this dataset, there is no OOM error in validation. So I guess the OOM error due to the different image size in the dataset. Maybe the OOM problem due to the resize operation, because the image size in cityscapes is very large, but all the images have the same size and it can also work very well....

hexuedong1117 · 2022-05-27T03:09:39Z

Thank you very much. What you said is very useful, your idea should be correct.After I used your method, there are no more errors.

* feat : add log-rho deis multistep deis * docs :fix typo * docs : add docs for impl algo * docs : remove duplicate ref * finish deis * add docs * fix Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

MeowZheng self-assigned this Mar 31, 2022

MeowZheng added the awaiting response label Apr 1, 2022

MeowZheng assigned linfangjian01 May 3, 2022

MeowZheng added documentation Improvements or additions to documentation FAQ labels May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda out of memory in validation #1432

Cuda out of memory in validation #1432

ChenDirk commented Mar 30, 2022

MeowZheng commented Mar 30, 2022

linfangjian01 commented Mar 31, 2022

ChenDirk commented Apr 12, 2022

ChenDirk commented Apr 14, 2022

YuriyKortev commented Apr 21, 2022 •

edited

Loading

hexuedong1117 commented May 26, 2022

ChenDirk commented May 26, 2022

hexuedong1117 commented May 27, 2022

Cuda out of memory in validation #1432

Cuda out of memory in validation #1432

Comments

ChenDirk commented Mar 30, 2022

MeowZheng commented Mar 30, 2022

linfangjian01 commented Mar 31, 2022

ChenDirk commented Apr 12, 2022

ChenDirk commented Apr 14, 2022

YuriyKortev commented Apr 21, 2022 • edited Loading

hexuedong1117 commented May 26, 2022

ChenDirk commented May 26, 2022

hexuedong1117 commented May 27, 2022

YuriyKortev commented Apr 21, 2022 •

edited

Loading