Fix overlap communication of ZeRO stage 1 and 2 #5606

penn513 · 2024-06-03T09:21:33Z

deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor only sets reduction stream waiting for default stream. This is ok in cases where the computation time is longer than the communication time, but when the communication time is longer, it may result in a rewrite of the ipg_buffer when the communication is not completed.

To fix this bug, the easiest way is just add default stream to wait for reduction stream at the same point. For example, in point 1, the reduction stream needs to wait for '2', so we add a wait_stream to reduction stream waiting for default stream. Also, the default stream needs to wait for 'A', so we need to add a wait_stream to default stream waiting for reduction stream before the 'B'.

Compared with the modification of #5523, wait_stream does not cause host synchronization.

Compared with the modification of #5545, the modification is more simple and the logic is the same, just waiting for what needs to wait.

With this modification, losses of Qwen-1.5 with and without overlap_comm are totally identical.

On the contrary, there is an obvious gap with a small sequence length, which means a short computation time.

penn513 · 2024-06-03T14:12:42Z

@microsoft-github-policy-service agree company="Huawei"

CurryRice233 · 2024-06-04T02:15:55Z

Hi @tjruwase @GuanhuaWang , would you please help to review this PR?

GuanhuaWang · 2024-06-05T14:56:50Z

Hi @penn513 thx for the nice figure and pr.

I think your fix on compute stream wait back reduce stream make sense to me, especially when compute is shorter.

To make PR more concise, could you remove your modification on npu fuseadam in current pr, and make a new pr on fused adam? (mainly because it is irrelevant with this PR title as "fix overlap communication...")

Co-authored-by: CurryRice233 <nmeia@qq.com>

penn513 · 2024-06-06T07:26:08Z

Hi @penn513 thx for the nice figure and pr.

I think your fix on compute stream wait back reduce stream make sense to me, especially when compute is shorter.

To make PR more concise, could you remove your modification on npu fuseadam in current pr, and make a new pr on fused adam? (mainly because it is irrelevant with this PR title as "fix overlap communication...")

Thanks for your reply. It's been updated.

`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor` only sets reduction stream waiting for default stream. This is ok in cases where the computation time is longer than the communication time, but when the communication time is longer, it may result in a rewrite of the ipg_buffer when the communication is not completed. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0) To fix this bug, the easiest way is just add default stream to wait for reduction stream at the **same point**. For example, in point 1, the `reduction stream` needs to wait for '2', so we add a wait_stream to `reduction stream` waiting for `default stream`. Also, the `default stream` needs to wait for 'A', so we need to add a wait_stream to `default stream` waiting for `reduction stream` before the 'B'. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1) Compared with the modification of deepspeedai#5523, wait_stream does not cause host synchronization. Compared with the modification of deepspeedai#5545, the modification is more simple and the logic is the same, just waiting for what needs to wait. --- With this modification, losses of Qwen-1.5 with and without overlap_comm are totally identical. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f) --- On the contrary, there is an obvious gap with a small sequence length, which means a short computation time. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d) Co-authored-by: gp513 <guopeng34@huawei.com> Co-authored-by: CurryRice233 <nmeia@qq.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

penn513 requested review from tjruwase and mrwyattii as code owners June 3, 2024 09:21

penn513 force-pushed the master branch 2 times, most recently from 6a001f9 to 1319870 Compare June 3, 2024 14:09

tjruwase requested review from GuanhuaWang and removed request for mrwyattii June 4, 2024 07:54

fix overlap comm bug for ZeRO stage 1&2

6c75381

Co-authored-by: CurryRice233 <nmeia@qq.com>

penn513 force-pushed the master branch from 1319870 to 6c75381 Compare June 6, 2024 07:19

Merge branch 'master' into master

4074eb4

jomayeri approved these changes Jun 7, 2024

View reviewed changes

jomayeri added this pull request to the merge queue Jun 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 7, 2024

jomayeri added this pull request to the merge queue Jun 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 7, 2024

Merge branch 'master' into master

604e9d3

loadams enabled auto-merge June 7, 2024 22:21

loadams added this pull request to the merge queue Jun 9, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 9, 2024

loadams added this pull request to the merge queue Jun 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 10, 2024

tjruwase added this pull request to the merge queue Jun 10, 2024

Merged via the queue into deepspeedai:master with commit a41729f Jun 10, 2024
15 checks passed

efsotr mentioned this pull request Jul 12, 2024

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2 #5242

Open

tjruwase mentioned this pull request Aug 3, 2024

[BUG] deepspeed overlap_comm data race #5545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix overlap communication of ZeRO stage 1 and 2 #5606

Fix overlap communication of ZeRO stage 1 and 2 #5606

penn513 commented Jun 3, 2024 •

edited

Loading

penn513 commented Jun 3, 2024

CurryRice233 commented Jun 4, 2024

GuanhuaWang commented Jun 5, 2024 •

edited

Loading

penn513 commented Jun 6, 2024

Fix overlap communication of ZeRO stage 1 and 2 #5606

Fix overlap communication of ZeRO stage 1 and 2 #5606

Conversation

penn513 commented Jun 3, 2024 • edited Loading

penn513 commented Jun 3, 2024

CurryRice233 commented Jun 4, 2024

GuanhuaWang commented Jun 5, 2024 • edited Loading

penn513 commented Jun 6, 2024

penn513 commented Jun 3, 2024 •

edited

Loading

GuanhuaWang commented Jun 5, 2024 •

edited

Loading