Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

taozhiwei
Copy link
Contributor

@taozhiwei taozhiwei commented Jun 7, 2024

Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.

  1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather
  2. both non-expert and expert all use TP:
    • the original execution order: alltoall -> exe MOE -> alltoall
    • optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather

@tjruwase
Copy link
Contributor

tjruwase commented Jun 9, 2024

@siddharth9820, can you help review?

@tjruwase tjruwase requested review from tohtana and removed request for arashb and mrwyattii June 9, 2024 22:37
@siddharth9820
Copy link
Contributor

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!

@taozhiwei
Copy link
Contributor Author

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!
Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.

@taozhiwei taozhiwei closed this Jun 17, 2024
@taozhiwei taozhiwei reopened this Jun 17, 2024
@taozhiwei
Copy link
Contributor Author

@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.

@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 18, 2024

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

@taozhiwei
Copy link
Contributor Author

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

I ran a test locally and it still converged

@siddharth9820
Copy link
Contributor

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 21, 2024

@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626
This test failed due to network issues and needs to be triggered CI test again. Thank you

@taozhiwei
Copy link
Contributor Author

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.
How do I put the pictures in? I tried a few times but was unsuccessful.

@siddharth9820
Copy link
Contributor

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 25, 2024

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m
Please help review again @siddharth9820 @tjruwase

…t are tensor-parallel #5626

Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 27, 2024

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@tjruwase
Copy link
Contributor

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 28, 2024

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626.
The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test
Can you help run the CI test again? thank you! @siddharth9820

@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 28, 2024

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@tjruwase
Copy link
Contributor

tjruwase commented Jul 1, 2024

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

@taozhiwei
Copy link
Contributor Author

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

thanks a lot,There are still failed tests this time.https://github.com/microsoft/DeepSpeed/actions/runs/9728828772/job/26861161513?pr=5626
I am unable to reproduce these failed tests locally, CI see detailed logs of failures?

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jul 3, 2024

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

The CI failed again, Please help trigger it again. thanks @tjruwase @siddharth9820

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jul 10, 2024

due http 429 failed again. Can I modify this test code to catch 429 exception or add sleep ? @tjruwase @siddharth9820

def sleepy_iterator(numbers, sleep_time):
    for number in numbers:
        yield number
        time.sleep(sleep_time)

model_data["model_list"] = [ModelInfo(modelId=m.modelId, pipeline_tag=m.pipeline_tag, tags=m.tags) for m in sleepy_iterator(api.list_models(), 1))]

@siddharth9820
Copy link
Contributor

@tjruwase anyway I can help out with this?

@tjruwase
Copy link
Contributor

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.

@taozhiwei, thanks for your patience.

@taozhiwei
Copy link
Contributor Author

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.

@taozhiwei, thanks for your patience.

@tjruwase failed again.

@loadams
Copy link
Contributor

loadams commented Jul 19, 2024

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.
@taozhiwei, thanks for your patience.

@tjruwase failed again.

Hi @taozhiwei, we are still working on this, as soon as we can get things running in CI again (infrastructure issues) I'll make sure this PR is merged.

@loadams
Copy link
Contributor

loadams commented Jul 20, 2024

@tjruwase anyway I can help out with this?

@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve.
@taozhiwei, thanks for your patience.

@tjruwase failed again.

Hi @taozhiwei, we are still working on this, as soon as we can get things running in CI again (infrastructure issues) I'll make sure this PR is merged.

Tests passed, re-merging develop then this should be merged. Thanks for your patience, @taozhiwei

@loadams loadams enabled auto-merge July 20, 2024 23:18
@loadams loadams added this pull request to the merge queue Jul 22, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 22, 2024
@loadams loadams added this pull request to the merge queue Jul 22, 2024
Merged via the queue into microsoft:master with commit f5d6c63 Jul 23, 2024
14 checks passed
@siddharth9820
Copy link
Contributor

Lessgo!

@taozhiwei
Copy link
Contributor Author

Thank you very much for your help! @loadams @tjruwase @siddharth9820

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants