-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626
Conversation
@siddharth9820, can you help review? |
At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this! |
|
@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much. |
@taozhiwei still lgtm. Do you have some convergence curves for your changes? |
I ran a test locally and it still converged |
Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go. |
@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626 |
|
@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us? |
This is a comparison of the loss curve before and after modification, which is consistent. |
…t are tensor-parallel #5626 Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
Thanks for doing this. LGTM. @tjruwase do we need any other tests? |
@taozhiwei, thanks for the PR. This is really a great contribution. @siddharth9820, thanks for helping to review. Approved. |
The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626. |
@tjruwase or someone else working at Deepspeed might be able to help you with CI |
@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap. |
thanks a lot,There are still failed tests this time.https://github.com/microsoft/DeepSpeed/actions/runs/9728828772/job/26861161513?pr=5626 |
The CI failed again, Please help trigger it again. thanks @tjruwase @siddharth9820 |
due http 429 failed again. Can I modify this test code to catch 429 exception or add sleep ? @tjruwase @siddharth9820
|
@tjruwase anyway I can help out with this? |
@siddharth9820, thanks for offering. The problem is due to our flaky CI, and we working to resolve. @taozhiwei, thanks for your patience. |
@tjruwase failed again. |
Hi @taozhiwei, we are still working on this, as soon as we can get things running in CI again (infrastructure issues) I'll make sure this PR is merged. |
Tests passed, re-merging develop then this should be merged. Thanks for your patience, @taozhiwei |
Lessgo! |
Thank you very much for your help! @loadams @tjruwase @siddharth9820 |
Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]
The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.
After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.