-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda out of memory in validation #1432
Comments
It would be better to provide the cfg file you run, and we can repeat the problem you met |
Hi, during training, you can set crop size to 512x512 to ensure that the input image size is 512x512. However, in the process of validation, if the 'whole' inference is used, the size of the input image may change.
|
Hi, I print the shape of img in the whole_inference function in mmseg/models/segmentors/encoder_decoder.py, and all the shape are torch.Size([1, 3, 512, 512]), but I still meet the same OOM problem. During the training, the gpu memory is only occupied 11GB but during validation, more than 24GB. I don't think it is the problem of input size, because I set keep_ratio=False when I resize the test data. BTW, when I use tools/dist_test.sh to test with the last pth file, there is no OOM problem, I don't know the reason... |
I use the custom data and custom model, so I think it's very hard to repeat the problem. Now, my solution is set --no-validate during the training and test the performance with all the checkpoint files after training, but it looks like stupied... |
With same problem here, im using mode='slide', current implementation of slide inference first moves whole images on gpu and only then slices them on crops, i think combining this implementation with MultiScaleFlipAug gives crazy consuming of gpu memory. Is there any ideas how to deal with this? |
I also had the same problem。I have three devices, two of which run the same code without any issues, but the third device will have a GPU memory abnormal growth when testing.And the occupancy of GPU memory keeps changing. After I observed the size of the input image, the memory change does not seem to be related to the size of the image.(The same goes for the size of the input on other machines. And the GPU memory occupied by other machines during the test does not change and does not use a lot of memory.) |
Hi, |
Thank you very much. What you said is very useful, your idea should be correct.After I used your method, there are no more errors. |
* feat : add log-rho deis multistep deis * docs :fix typo * docs : add docs for impl algo * docs : remove duplicate ref * finish deis * add docs * fix Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
I use 8 GPUS to train one model. During the training, the cuda occupied about 19K MiB (24k total), but in validation, it needs more than 24k MiB, out of memory and stop training. In training, the random crop size is 512,512, and in validation and test, the data will be resize to 512,512 (keep_ratio=False)。At the begining, I think it's due to the softmax layer in inference, because the number of class is very large(194). So I remove it, but do not fix the problem. Can you tell me other possible reasons for the problem? Thanks a lot!
The text was updated successfully, but these errors were encountered: