-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][distributed] fix custom allreduce in pytorch 2.5 #9815
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
cc @hanzhi713 can you please improve the code if you have time? ideally we should use pytorch's user-facing api, rather than this private api. example of user-facing api: # producer:
import torch
from torch.multiprocessing.reductions import reduce_tensor
inp = torch.randn(5, 5).cuda()
out = reduce_tensor(inp)
# send `out` to consumer
# consumer
func = out[0]
tensor = func(*out[1]) this way, we won't suffer from too many pytorch internal details. and once we share the tensor from python side, c++ side code will be much simpler, and we can also benefit from expandable segment in the future. |
FYI - can confirm this fixes the issue with TP=2 on NVLink A6000's that was introduced with the upgrade to pytorch 2.5. Nice catch on the handle format change. I thought I was going crazy, but hadn't noticed the 2-byte change in length of the handle itself when I compared the data doing into the ipc functions in the two versions. |
@cedonley thanks for your report and investigation! |
@youkaichao There're no user facing API for getting a shareable handle. To avoid using internal Pytorch API, I think we can just call vllm/csrc/custom_all_reduce.cuh Line 359 in 04a3ae0
The downside would be that we will lose pytorch's safeguards against leaks, but I think that might not be a problem since allocations in custom allreduce are one-time allocations. |
I will have some time this weekend to work on this. |
do you really need a handle? we can get an ipc-shared tensor directly: # producer:
import torch
from torch.multiprocessing.reductions import reduce_tensor
inp = torch.randn(5, 5).cuda()
out = reduce_tensor(inp)
# send `out` to consumer
# consumer
func = out[0]
tensor = func(*out[1]) |
@hanzhi713 welcome to join our new slack https://slack.vllm.ai for chatting and collaborating! |
How do you share such a handle to other processes? I think vllm still runs one process per GPU right? Is it just sending pickled data through torch.distributed? IMHO torch.multiprocessing is designed for sharing handles through multiprocessing.Process. It may work with generic processes, but I'm not sure if there are any caveats. |
yes.
it applies to general processes, no matter how the process is launched. |
…#9815) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com>
…#9815) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>
…#9815) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>
fixes #9774
pytorch changes the binary format of the ipc handle.