-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] avoid too many cuda context by caching p2p test #4021
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: why do we write the cache to a file? Is it so that all the workers can access the cache info?
In this case, what's happening if there are 2 vllm instances running at the same time? (isn't the cache file overwriting each other, or is it just to check gpu to gpu accessibility, so that they can share the same result?)
If we don't cache it into a file, every master process still needs to initialize cuda context in every gpu, that will lead to
Note that the cache file name is suffixed by |
QQ: Can we directly use cudaIpcOpenMemHandle instead? |
We should seek help from @hanzhi713 , I'm not familiar with this :( |
We can use |
Makes sense! Maybe it'd be great to have a simple comment in the function for the motivation of writing it to a file! |
Technically many functions can be used to detect p2p access, but the details of which function will cost additional cuda context is extemely vague. cudaIpcOpenMemHandle or cudaDeviceCanAccessPeer might work, but it is hard to say if they will place That said, I think caching the result is a universal way, regardless of how we detect p2p access. |
Anyone stamp my PR? Or any additional modification required? |
|
Multi node is not working with vllm anyway, so not sure if we should handle it in this PR |
This is not correct. vllm indeed supports multi-node. @esmeetu I added the case for multi-node, by letting one process per node to create the cache (i.e. |
Technically either cudaDeviceCanAccessPeer or cudaIpcOpenMemHandle would suffice. cudaDeviceCanAccessPeer is called by torch.cuda.can_device_access_peer. However, cuda driver sometimes is buggy and might report p2p being supported even though it's not. This can occur on 3090 and 4090. Thus, we need to perform actual p2p copies and check whether the result is correct to mitigate the driver bug. |
I can confirm Either way, the caching in this PR is reasonable. p2p access pattern between GPUs seldom change. |
Multi-node was not supported when using custom-all-reduce. So this PR looks good to me. |
It is observed in #3821 that every worker takes memory in every GPU in$n * (n-1)$ cuda context.
_can_p2p
, because the test will make all process allocate cuda context for every GPU, in total leading toTo avoid this, we can cache the test of p2p test.
Before this PR (tp=4):
GPU blocks: 11762, CPU blocks: 2048
Throughput: 8.17 requests/s, 3932.45 tokens/s
After this PR (tp=4):
GPU blocks: 12222, CPU blocks: 2048
Throughput: 8.21 requests/s, 3953.50 tokens/s
Conclusion
Fight really hard to save 306MiB * tp * tp memory.