Oscillation in the train loss on multiple GPUs

I'm running the command provided [here](https://github.com/huggingface/pytorch-image-models/issues/252#issuecomment-711328340) for training a ViT, once using batch size 288 on two GPUs (like the link), and once using batch size 576 on one GPU. As you can see in the plot below, the training loss for the run with one GPU is much smoother than the one with two GPUs, which oscillates a lot (although still similar decreasing trend), and sometimes makes training unstable. 
Is this behaviour expected? If not, I suspect there should be some errors in the implementation of the multi-GPU code, but couldn't find out. Can you please have a look?
Thanks!

**To Reproduce**
`./distributed_train.sh {1 or 2} --data-dir /path/to/100class/data --num-classes 100 --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b {576 or 288}`

**Expected behavior**
Fairly similar train loss for the experiments.

**Screenshots**
![W B Chart 2024-11-19, 11_06_25 PM](https://github.com/user-attachments/assets/35a71e4d-4323-477b-8e93-c630c3926fee)

**Environment**
 - OS: Ubuntu 22.04
 - Using timm v1.0.11
 - PyTorch 2.4.1 with CUDA 12.2

**Additional context**
I also tried an experiment using batch size 288 on one GPU but with `--grad-accum-steps 2` (to have a global batch size of 576 like the other experiments) and saw no problem (no extreme oscillation) in the loss plot, it was alright like the other one on a single GPU.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Oscillation in the train loss on multiple GPUs #2339

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Oscillation in the train loss on multiple GPUs #2339

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions