Training memory requirement #121

SamedYalcin · 2024-03-16T19:16:25Z

Hi,

I'm trying to train your model on Kaggle with P100 w\16GB VRAM however I'm running out of memory. Can you share the memory requirements and if possible tips to reduce memory required?

Attached below is the model I'm trying to train. Instead of train.sh, I'm using train.py.

The text was updated successfully, but these errors were encountered:

TempleX98 · 2024-03-17T07:18:57Z

We use A100 80G GPUs to train Swin-L and ViT-L models. You can reduce the image size to 1333x800 or freeze the backbone during training. Besides, some techniques such as FSDP and FP16, can help you reduce training memory consumption. Please refer to the latest mmdetection v3 for more details.

SamedYalcin · 2024-03-17T09:06:34Z

Reducing image size helps for a few batches but after a while it fails again. I will try freezing the backbone. About your last suggestion, does this repo work on MMDetectionV3?

TempleX98 · 2024-03-17T10:56:04Z

Sure, please refer to https://github.com/open-mmlab/mmdetection/tree/main/projects/CO-DETR

SamedYalcin · 2024-03-17T14:51:04Z

I wasn't able to find it under Model Zoo. Thanks for pointing out and thanks for the help.

MMDetection repo seems to use co_dino_5scale_swin_large_16e_o365tococo.pth instead of co_dino_5scale_swin_large_1x_coco.pth. Is this a mistake? co_dino_5scale_swin_large_16e_o365tococo.pth seems to use Object365 labels where as co_dino_5scale_swin_large_1x_coco.pth uses COCO labels. The config is for COCO.

SamedYalcin · 2024-03-17T18:06:14Z

Edited the comment.

TempleX98 · 2024-03-18T08:02:27Z

This config is used to finetune the Objects365 pretrained Swin-L on the COCO dataset.
If you want to train this model on your custom dataset, I recommend using co_dino_5scale_swin_large_16e_o365tococo.pth for better performance.

SamedYalcin · 2024-03-20T11:20:38Z

For new comers:

Gradient checkpointing or reducing image size alone is not enough to train with 16GB of VRAM. I had to freeze the backbone completely, enable checkpointing and reduce the image size to 1333x800. Memory usage peaked around ~15GB.
Automatic Mixed Precision traning throws a runtime error. Maybe it's not supported?
I cannot comment of FSDP as I train on a single GPU and from my understading I need to use distributed training to enable FSDP.

Thanks for the help @TempleX98. Feel free to close the issue at your convenience.

TempleX98 closed this as completed Apr 1, 2024

TempleX98 mentioned this issue Sep 20, 2024

How much VRAM is needed to finetune co_dino_5scale_vit_large_coco? #167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training memory requirement #121

Training memory requirement #121

SamedYalcin commented Mar 16, 2024

TempleX98 commented Mar 17, 2024

SamedYalcin commented Mar 17, 2024 •

edited

Loading

TempleX98 commented Mar 17, 2024

SamedYalcin commented Mar 17, 2024 •

edited

Loading

SamedYalcin commented Mar 17, 2024

TempleX98 commented Mar 18, 2024

SamedYalcin commented Mar 20, 2024

Training memory requirement #121

Training memory requirement #121

Comments

SamedYalcin commented Mar 16, 2024

TempleX98 commented Mar 17, 2024

SamedYalcin commented Mar 17, 2024 • edited Loading

TempleX98 commented Mar 17, 2024

SamedYalcin commented Mar 17, 2024 • edited Loading

SamedYalcin commented Mar 17, 2024

TempleX98 commented Mar 18, 2024

SamedYalcin commented Mar 20, 2024

SamedYalcin commented Mar 17, 2024 •

edited

Loading

SamedYalcin commented Mar 17, 2024 •

edited

Loading