[BUG] ZeRO optimizer with MoE Expert Parallelism #5618
Closed
Description
opened on Jun 5, 2024
Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed:
- partition logic of expert params.
- average_tensor used in gradient reduce in zero2
To Reproduce
Steps to reproduce the behavior:
use ep=4 and adamw optimizer to train llm
Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1
Activity