-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551
Conversation
Hi @Yejing-Lai , do you want to provide some comments on this PR for Qwen2-MoE AutoTP support? |
Could you try to modify this line if it can meet your needs? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/auto_tp.py#L336 |
Yes. It can provide the same function and result if probably coded. |
Thank you for your comments. |
…() for uniform code management. Both have the same function and the same result.
Hi @gyou2021 , can you also add |
Added. Thank you for your comment. |
Hi @tjruwase This PR adds AutoTP support for Qwen2-MoE. @Yejing-Lai and me had reviewed this change. Thanks! |
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance.