Skip to content

Oscillation in the train loss on multiple GPUs #2339

Closed
@sinahmr

Description

@sinahmr

I'm running the command provided here for training a ViT, once using batch size 288 on two GPUs (like the link), and once using batch size 576 on one GPU. As you can see in the plot below, the training loss for the run with one GPU is much smoother than the one with two GPUs, which oscillates a lot (although still similar decreasing trend), and sometimes makes training unstable.
Is this behaviour expected? If not, I suspect there should be some errors in the implementation of the multi-GPU code, but couldn't find out. Can you please have a look?
Thanks!

To Reproduce
./distributed_train.sh {1 or 2} --data-dir /path/to/100class/data --num-classes 100 --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b {576 or 288}

Expected behavior
Fairly similar train loss for the experiments.

Screenshots
W B Chart 2024-11-19, 11_06_25 PM

Environment

  • OS: Ubuntu 22.04
  • Using timm v1.0.11
  • PyTorch 2.4.1 with CUDA 12.2

Additional context
I also tried an experiment using batch size 288 on one GPU but with --grad-accum-steps 2 (to have a global batch size of 576 like the other experiments) and saw no problem (no extreme oscillation) in the loss plot, it was alright like the other one on a single GPU.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions