Skip to content

Conversation

@KaparthyReddy
Copy link

Fix: Correct loss normalization in training_step for multi-GPU training

Description

Fixes #37474

This PR corrects the loss aggregation logic in Trainer.training_step when training with multiple GPUs.

Problem

When num_items_in_batch is provided (e.g., for token-level loss normalization), each device computes loss as:

per_device_loss = sum_of_losses / total_items_across_all_devices

- Fix bug where mean token loss was incorrectly normalized with multiple GPUs
- When num_items_in_batch is provided, per-device losses are already
  normalized by total tokens across all devices
- Changed from .mean() to .sum() for aggregating these normalized losses
- Fixes issue where reported loss was 1/n_gpu of expected value

With 4 GPUs, loss was incorrectly reported as 2.7 instead of 10.8.
This fix ensures consistent loss reporting across different GPU configurations.

Fixes huggingface#37474
@Rocketknight1
Copy link
Member

This should already be fixed in #40799

@KaparthyReddy can you cool it down with these PRs, or at least read issues carefully and only make PRs when you understand the issue and you've checked they're not duplicates? You've made like 6 of them and they're all obviously written by Copilot, so it's very hard to tell if any of them actually do anything useful!

@KaparthyReddy
Copy link
Author

Noted. I’ll proceed as I see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1

2 participants