Fix gradient accumulation in post training sft #3040

ChingTsai · 2026-01-29T07:57:32Z

Description

This PR resolves discrepancies in loss calculation and training step counts when running SFT with gradient accumulation enabled. Currently, MaxText.sft.sft_trainer (which uses the Tunix trainer) exhibits breaking behavior when gradient accumulation is turned on, diverging significantly from the native implementation in MaxText.sft_trainer. This change aligns the Tunix-based SFT logic to match the native behavior.

Problem Statement

Loss Disparity: When GA is enabled, MaxText-Tunix observed a massive loss scale disparity compared to the native implementation. This occurs because MaxText-Tunix uses the same loss_fn as the native implementation. The native logic explicitly skips dividing by total_weights inside the function (deferring normalization to a later stage here). Consequently, Tunix inherited this behavior and failed to normalize the loss, resulting in broken calculations and inflated values.

Step Count Mismatch: While MaxText-Native handles micro-batching internally by reshaping the full global batch, Tunix relies on the input pipeline to provide pre-sized micro-batches. Without this adjustment, Tunix was ingesting full global batches at every step, resulting in incorrect epoch calculations and causing the run to terminate prematurely compared to the native implementation.

FIXES: b/478823561

Tests

python3 -m MaxText.sft_trainer \
    src/MaxText/configs/sft.yml \
    run_name=$RUN_NAME \
    base_output_directory=..../qwen3-4b \
    model_name=qwen3-4b \
    load_parameters_path=..../qwen3-4b/0/items \
    tokenizer_path=Qwen/qwen3-4b \
    steps=$train_step \
    profiler=xplane \
    hf_path=arrow \
    dataset_type=hf \
    train_split=train \
    hf_train_files=..../data-00000-of-00001.arrow \
    hf_eval_files=..../data-00000-of-00001.arrow \
    per_device_batch_size=4 \
    gradient_accumulation_steps=4 \
    max_target_length=1024 \
    learning_rate=1.3e-5 \
    warmup_steps_fraction=0.05 \
    data_shuffle_seed=42 \
    gradient_clipping_threshold=1 \
    learning_rate_final_fraction=0 \
    weight_dtype=bfloat16

After applying the changes, the loss graphs of both versions are now almost identical.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-29T08:07:15Z

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/input_pipeline/_hf_data_processing.py	33.33%	2 Missing ⚠️
src/MaxText/train.py	0.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

ChingTsai changed the title ~~Fix loss and batching when using tunix~~ Fix gradient accumulation in post training Jan 29, 2026

ChingTsai changed the title ~~Fix gradient accumulation in post training~~ Fix gradient accumulation in post training sft Jan 29, 2026

ChingTsai force-pushed the jimmytsai/fix-ga-in-sft-trainer branch from 69e7031 to f36a364 Compare January 29, 2026 08:44

ChingTsai self-assigned this Jan 29, 2026

ChingTsai force-pushed the jimmytsai/fix-ga-in-sft-trainer branch from f36a364 to b891b70 Compare January 29, 2026 14:20

Fix loss fn and data batching when using tunix

d44bc7b

ChingTsai force-pushed the jimmytsai/fix-ga-in-sft-trainer branch from b891b70 to d44bc7b Compare January 29, 2026 14:28

ChingTsai marked this pull request as ready for review January 29, 2026 14:28

ChingTsai requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners January 29, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient accumulation in post training sft #3040

Fix gradient accumulation in post training sft #3040

Uh oh!

ChingTsai commented Jan 29, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix gradient accumulation in post training sft #3040

Are you sure you want to change the base?

Fix gradient accumulation in post training sft #3040

Uh oh!

Conversation

ChingTsai commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChingTsai commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading