-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
[V1] Fix Compilation config & Enable CUDA graph by default #10528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
…ect#10528) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…ect#10528) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
||
_, total_gpu_memory = torch.cuda.mem_get_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this doesn't just use total_gpu_memory
from after the profile run (like it was done before)?
torch.cuda.empty_cache() | ||
torch_allocated_bytes = torch.cuda.memory_stats( | ||
)["allocated_bytes.all.current"] | ||
total_allocated_bytes = torch.cuda.mem_get_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too pessimistic, if anything else is allocated on the GPU (like when we want to do 2 llm instances in tests), this will count all that memory as if it was allocated in the forward pass. I think we should instead just subtract 2 values here: #18974
This PR override the compilation mode that the user provides is that expected?
then level will be overriden by this code when VLLM_USE_V1 is on. and go back to become PIECEWISE
|
Yes, I believe #19340 is supposed to address this. |
This PR fixes a performance bug on V1 introduced by #10437, which disabled custom ops even when
torch.compile
was not used.Also, this PR enables the piecewise CUDA graphs by default since #10237 was merged.