Skip to content

Fix cp 7807#7878

Open
nathon-lee wants to merge 6 commits intodeepspeedai:masterfrom
nathon-lee:fix_cp_7807
Open

Fix cp 7807#7878
nathon-lee wants to merge 6 commits intodeepspeedai:masterfrom
nathon-lee:fix_cp_7807

Conversation

@nathon-lee
Copy link
Contributor

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter

This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.

The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.

Closes #7807

nathon-lee and others added 6 commits January 24, 2026 05:01
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Use ZeRO stage 1 to use BF16 optimizer.
(We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change.
@sfc-gh-truwase)

- deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad
accumulation without ZeRO, so that combo now raises NotImplementedError.
- deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by
setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now
invalid after deepspeedai#7790.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4665aa9345

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +297 to +298
if self.reduce_scatter and self.uses_muon:
assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow Muon fallback instead of aborting on reduce_scatter

This new assert False makes ZeRO-1/2 initialization fail whenever reduce_scatter=True and any parameter has use_muon=True, which blocks the exact training configuration this change is trying to handle. The later average_tensor change already adds a fallback path (all-reduce when Muon is present), so this assertion prevents that fix from ever running in normal Python execution and turns a correctness bug into a hard runtime failure for Muon users.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Cross-partition parameters incorrectly updated when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer

2 participants