Description
I'm running the command provided here for training a ViT, once using batch size 288 on two GPUs (like the link), and once using batch size 576 on one GPU. As you can see in the plot below, the training loss for the run with one GPU is much smoother than the one with two GPUs, which oscillates a lot (although still similar decreasing trend), and sometimes makes training unstable.
Is this behaviour expected? If not, I suspect there should be some errors in the implementation of the multi-GPU code, but couldn't find out. Can you please have a look?
Thanks!
To Reproduce
./distributed_train.sh {1 or 2} --data-dir /path/to/100class/data --num-classes 100 --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b {576 or 288}
Expected behavior
Fairly similar train loss for the experiments.
Environment
- OS: Ubuntu 22.04
- Using timm v1.0.11
- PyTorch 2.4.1 with CUDA 12.2
Additional context
I also tried an experiment using batch size 288 on one GPU but with --grad-accum-steps 2
(to have a global batch size of 576 like the other experiments) and saw no problem (no extreme oscillation) in the loss plot, it was alright like the other one on a single GPU.