Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training memory requirement #121

Closed
SamedYalcin opened this issue Mar 16, 2024 · 7 comments
Closed

Training memory requirement #121

SamedYalcin opened this issue Mar 16, 2024 · 7 comments

Comments

@SamedYalcin
Copy link

Hi,

I'm trying to train your model on Kaggle with P100 w\16GB VRAM however I'm running out of memory. Can you share the memory requirements and if possible tips to reduce memory required?

Attached below is the model I'm trying to train. Instead of train.sh, I'm using train.py.

image

@TempleX98
Copy link
Collaborator

We use A100 80G GPUs to train Swin-L and ViT-L models. You can reduce the image size to 1333x800 or freeze the backbone during training. Besides, some techniques such as FSDP and FP16, can help you reduce training memory consumption. Please refer to the latest mmdetection v3 for more details.

@SamedYalcin
Copy link
Author

SamedYalcin commented Mar 17, 2024

Reducing image size helps for a few batches but after a while it fails again. I will try freezing the backbone. About your last suggestion, does this repo work on MMDetectionV3?

@TempleX98
Copy link
Collaborator

@SamedYalcin
Copy link
Author

SamedYalcin commented Mar 17, 2024

I wasn't able to find it under Model Zoo. Thanks for pointing out and thanks for the help.

image
MMDetection repo seems to use co_dino_5scale_swin_large_16e_o365tococo.pth instead of co_dino_5scale_swin_large_1x_coco.pth. Is this a mistake? co_dino_5scale_swin_large_16e_o365tococo.pth seems to use Object365 labels where as co_dino_5scale_swin_large_1x_coco.pth uses COCO labels. The config is for COCO.

@SamedYalcin
Copy link
Author

Edited the comment.

@TempleX98
Copy link
Collaborator

This config is used to finetune the Objects365 pretrained Swin-L on the COCO dataset.
If you want to train this model on your custom dataset, I recommend using co_dino_5scale_swin_large_16e_o365tococo.pth for better performance.

@SamedYalcin
Copy link
Author

For new comers:

  • Gradient checkpointing or reducing image size alone is not enough to train with 16GB of VRAM. I had to freeze the backbone completely, enable checkpointing and reduce the image size to 1333x800. Memory usage peaked around ~15GB.
  • Automatic Mixed Precision traning throws a runtime error. Maybe it's not supported?
  • I cannot comment of FSDP as I train on a single GPU and from my understading I need to use distributed training to enable FSDP.

Thanks for the help @TempleX98. Feel free to close the issue at your convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants