Skip to content

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Closed
@Jack47

Description

Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed:

  1. partition logic of expert params.
image
  1. average_tensor used in gradient reduce in zero2
image

To Reproduce
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions