Conversation
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4665aa9345
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.reduce_scatter and self.uses_muon: | ||
| assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer." |
There was a problem hiding this comment.
Allow Muon fallback instead of aborting on reduce_scatter
This new assert False makes ZeRO-1/2 initialization fail whenever reduce_scatter=True and any parameter has use_muon=True, which blocks the exact training configuration this change is trying to handle. The later average_tensor change already adds a fallback path (all-reduce when Muon is present), so this assertion prevents that fix from ever running in normal Python execution and turns a correctness bug into a hard runtime failure for Muon users.
Useful? React with 👍 / 👎.
fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter
This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.
The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.
Closes #7807