Skip to content

bug about Zero-optimizer #156

Closed
Closed
@gbxu

Description

In Section 6 of paper Rajbhandari, Samyam, et al. "ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.", authors explain that there is no more communication overhead for P_os and P_g.

I think the right algorithm is that for each weight, Zero-Adam will read a tensor after reduce-scatter op, then output the new 1/n part of this weight with 1/n part of optimizer states. This 1/n-size output of Zero-Adam will be passed to all_gather() or broadcast(), syncing with other processes.

However in deepspeed/pt/deepspeed_light.py#L666, the gradients will do an all_reduce before passed to optimizer. Is it right?

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions