Fix: Correct loss normalization in training_step for multi-GPU training #41530

KaparthyReddy · 2025-10-12T06:51:55Z

Fix: Correct loss normalization in training_step for multi-GPU training

Description

This PR corrects the loss aggregation logic in Trainer.training_step when training with multiple GPUs.

Problem

When num_items_in_batch is provided (e.g., for token-level loss normalization), each device computes loss as:

per_device_loss = sum_of_losses / total_items_across_all_devices

- Fix bug where mean token loss was incorrectly normalized with multiple GPUs - When num_items_in_batch is provided, per-device losses are already normalized by total tokens across all devices - Changed from .mean() to .sum() for aggregating these normalized losses - Fixes issue where reported loss was 1/n_gpu of expected value With 4 GPUs, loss was incorrectly reported as 2.7 instead of 10.8. This fix ensures consistent loss reporting across different GPU configurations. Fixes huggingface#37474

Rocketknight1 · 2025-10-13T14:02:30Z

This should already be fixed in #40799

@KaparthyReddy can you cool it down with these PRs, or at least read issues carefully and only make PRs when you understand the issue and you've checked they're not duplicates? You've made like 6 of them and they're all obviously written by Copilot, so it's very hard to tell if any of them actually do anything useful!

KaparthyReddy · 2025-11-14T16:35:22Z

Noted. I’ll proceed as I see fit.

Rocketknight1 closed this Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Correct loss normalization in training_step for multi-GPU training #41530

Fix: Correct loss normalization in training_step for multi-GPU training #41530

Uh oh!

KaparthyReddy commented Oct 12, 2025

Uh oh!

Rocketknight1 commented Oct 13, 2025

Uh oh!

KaparthyReddy commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Correct loss normalization in training_step for multi-GPU training #41530

Fix: Correct loss normalization in training_step for multi-GPU training #41530

Uh oh!

Conversation

KaparthyReddy commented Oct 12, 2025

Fix: Correct loss normalization in training_step for multi-GPU training

Description

Problem

Uh oh!

Rocketknight1 commented Oct 13, 2025

Uh oh!

KaparthyReddy commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants