[BUG] ZeRO optimizer with MoE Expert Parallelism

**Describe the bug**
Just like this PR: https://github.com/microsoft/DeepSpeed/pull/5259 , ZeRO optimizer also needs to be fixed：
1. partition logic of expert params.
<img width="808" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/1720972/b9554638-2224-4510-866c-6e6c416d0b08">

3. average_tensor used in gradient reduce in zero2
<img width="1143" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/1720972/1b8f0d0c-cfc8-4cfe-b6f7-ddd11117c070">

**To Reproduce**
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

**Expected behavior**
expert gradients  should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development