reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

taozhiwei · 2024-06-07T12:05:12Z

Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.

non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather
both non-expert and expert all use TP:
- the original execution order: alltoall -> exe MOE -> alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather

tjruwase · 2024-06-09T22:36:44Z

@siddharth9820, can you help review?

siddharth9820 · 2024-06-09T22:43:31Z

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!

taozhiwei · 2024-06-11T05:09:33Z

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!
Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.

taozhiwei · 2024-06-18T05:20:50Z

@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.

siddharth9820 · 2024-06-18T05:26:05Z

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

taozhiwei · 2024-06-18T08:56:43Z

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

I ran a test locally and it still converged

siddharth9820 · 2024-06-21T00:21:08Z

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.

taozhiwei · 2024-06-21T03:57:37Z

@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626
This test failed due to network issues and needs to be triggered CI test again. Thank you

taozhiwei · 2024-06-21T04:55:42Z

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.
How do I put the pictures in? I tried a few times but was unsuccessful.

siddharth9820 · 2024-06-21T11:58:41Z

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

taozhiwei · 2024-06-25T02:26:02Z

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m
Please help review again @siddharth9820 @tjruwase

…t are tensor-parallel #5626 Signed-off-by: --local <zhiwei.tao@enflame-tech.com>

siddharth9820 · 2024-06-27T14:08:19Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

tjruwase · 2024-06-27T14:10:15Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

taozhiwei · 2024-06-28T08:13:48Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626.
The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test
Can you help run the CI test again? thank you! @siddharth9820

siddharth9820 · 2024-06-28T09:05:23Z

@tjruwase or someone else working at Deepspeed might be able to help you with CI

tjruwase · 2024-07-01T14:12:06Z

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

taozhiwei · 2024-07-02T03:10:20Z

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

thanks a lot，There are still failed tests this time.https://github.com/microsoft/DeepSpeed/actions/runs/9728828772/job/26861161513?pr=5626
I am unable to reproduce these failed tests locally, CI see detailed logs of failures?

taozhiwei · 2024-07-03T09:52:36Z

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

The CI failed again, Please help trigger it again. thanks @tjruwase @siddharth9820

taozhiwei · 2024-07-10T03:23:28Z

due http 429 failed again. Can I modify this test code to catch 429 exception or add sleep ? @tjruwase @siddharth9820

def sleepy_iterator(numbers, sleep_time):
    for number in numbers:
        yield number
        time.sleep(sleep_time)

model_data["model_list"] = [ModelInfo(modelId=m.modelId, pipeline_tag=m.pipeline_tag, tags=m.tags) for m in sleepy_iterator(api.list_models(), 1))]

siddharth9820 · 2024-07-10T13:22:36Z

@tjruwase anyway I can help out with this?

tjruwase · 2024-07-10T15:24:43Z

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.

@taozhiwei, thanks for your patience.

taozhiwei · 2024-07-17T12:13:37Z

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.

@taozhiwei, thanks for your patience.

@tjruwase failed again.

loadams · 2024-07-19T15:12:36Z

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.
@taozhiwei, thanks for your patience.

@tjruwase failed again.

Hi @taozhiwei, we are still working on this, as soon as we can get things running in CI again (infrastructure issues) I'll make sure this PR is merged.

loadams · 2024-07-20T23:18:37Z

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.
@taozhiwei, thanks for your patience.

@tjruwase failed again.

Hi @taozhiwei, we are still working on this, as soon as we can get things running in CI again (infrastructure issues) I'll make sure this PR is merged.

Tests passed, re-merging develop then this should be merged. Thanks for your patience, @taozhiwei

siddharth9820 · 2024-07-23T02:04:42Z

Lessgo!

taozhiwei · 2024-07-23T03:35:39Z

Thank you very much for your help! @loadams @tjruwase @siddharth9820

taozhiwei requested review from tjruwase, mrwyattii, awan-10 and arashb as code owners June 7, 2024 12:05

tjruwase requested review from tohtana and removed request for arashb and mrwyattii June 9, 2024 22:37

taozhiwei closed this Jun 17, 2024

taozhiwei reopened this Jun 17, 2024

taozhiwei force-pushed the myfeature branch from 72bfcf8 to 4ada4c5 Compare June 24, 2024 11:32

reduce all-to-all communication volume when both expert and non-exper…

6db21bf

…t are tensor-parallel #5626 Signed-off-by: --local <zhiwei.tao@enflame-tech.com>

taozhiwei force-pushed the myfeature branch from 3f8b959 to 6db21bf Compare June 25, 2024 10:02

Merge branch 'master' into myfeature

9f98a52

tjruwase approved these changes Jun 27, 2024

View reviewed changes

Merge branch 'master' into myfeature

f85527e

Merge branch 'master' into myfeature

25e8e17

Merge branch 'master' into myfeature

55aef6d

loadams added 2 commits July 12, 2024 11:40

Merge branch 'master' into myfeature

23b828f

Merge branch 'master' into myfeature

1b84e35

Merge branch 'master' into myfeature

4a486ae

loadams enabled auto-merge July 20, 2024 23:18

loadams added this pull request to the merge queue Jul 22, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 22, 2024

Merge branch 'master' into myfeature

ada525b

loadams added this pull request to the merge queue Jul 22, 2024

Merged via the queue into microsoft:master with commit f5d6c63 Jul 23, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

taozhiwei commented Jun 7, 2024 •

edited

Loading

tjruwase commented Jun 9, 2024

siddharth9820 commented Jun 9, 2024

taozhiwei commented Jun 11, 2024

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 18, 2024 •

edited

Loading

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 21, 2024 •

edited

Loading

taozhiwei commented Jun 21, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 25, 2024 •

edited

Loading

siddharth9820 commented Jun 27, 2024 •

edited

Loading

tjruwase commented Jun 27, 2024

taozhiwei commented Jun 28, 2024 •

edited

Loading

siddharth9820 commented Jun 28, 2024 •

edited

Loading

tjruwase commented Jul 1, 2024

taozhiwei commented Jul 2, 2024

taozhiwei commented Jul 3, 2024 •

edited

Loading

taozhiwei commented Jul 10, 2024 •

edited

Loading

siddharth9820 commented Jul 10, 2024

tjruwase commented Jul 10, 2024

taozhiwei commented Jul 17, 2024

loadams commented Jul 19, 2024

loadams commented Jul 20, 2024

siddharth9820 commented Jul 23, 2024

taozhiwei commented Jul 23, 2024

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

Conversation

taozhiwei commented Jun 7, 2024 • edited Loading

tjruwase commented Jun 9, 2024

siddharth9820 commented Jun 9, 2024

taozhiwei commented Jun 11, 2024

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 18, 2024 • edited Loading

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 21, 2024 • edited Loading

taozhiwei commented Jun 21, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 25, 2024 • edited Loading

siddharth9820 commented Jun 27, 2024 • edited Loading

tjruwase commented Jun 27, 2024

taozhiwei commented Jun 28, 2024 • edited Loading

siddharth9820 commented Jun 28, 2024 • edited Loading

tjruwase commented Jul 1, 2024

taozhiwei commented Jul 2, 2024

taozhiwei commented Jul 3, 2024 • edited Loading

taozhiwei commented Jul 10, 2024 • edited Loading

siddharth9820 commented Jul 10, 2024

tjruwase commented Jul 10, 2024

taozhiwei commented Jul 17, 2024

loadams commented Jul 19, 2024

loadams commented Jul 20, 2024

siddharth9820 commented Jul 23, 2024

taozhiwei commented Jul 23, 2024

taozhiwei commented Jun 7, 2024 •

edited

Loading

siddharth9820 commented Jun 18, 2024 •

edited

Loading

taozhiwei commented Jun 21, 2024 •

edited

Loading

taozhiwei commented Jun 25, 2024 •

edited

Loading

siddharth9820 commented Jun 27, 2024 •

edited

Loading

taozhiwei commented Jun 28, 2024 •

edited

Loading

siddharth9820 commented Jun 28, 2024 •

edited

Loading

taozhiwei commented Jul 3, 2024 •

edited

Loading

taozhiwei commented Jul 10, 2024 •

edited

Loading