Closed
Description
In Section 6 of paper Rajbhandari, Samyam, et al. "ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.", authors explain that there is no more communication overhead for P_os and P_g.
I think the right algorithm is that for each weight, Zero-Adam will read a tensor after reduce-scatter op, then output the new 1/n part of this weight with 1/n part of optimizer states. This 1/n-size output of Zero-Adam will be passed to all_gather() or broadcast(), syncing with other processes.
However in deepspeed/pt/deepspeed_light.py#L666, the gradients will do an all_reduce before passed to optimizer. Is it right?
Metadata
Assignees
Labels
No labels