-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint #5074
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint #5074
Conversation
Performance-wise, this adds one memory copy operation into cudagraph. This should be negligible, as the operation will also be recorded in the cudagraph, and the copy size is quite small. I run the benchmark, and find the throughput diff is within run-to-run variations. Before this PR: After this PR: |
I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related? |
Yes it is possible. To cleverly use cudagraph with multiple sizes, we need some technique like sharing buffers, just as what we do for inputs, and what this PR does for outputs. |
If this PR is merged, more cuda graphs can be captured at no additional memory cost, which leads to further questions:
vllm/benchmarks/kernels/benchmark_mixtral_moe.py Lines 19 to 22 in fbdb7b3
cc @WoosukKwon for cudagraph Update: I realize that cudagraph will cost some memory anyway . Therefore we don't need to actively capture many cudagraphs. |
For llama2-7b: before this PR:
the first graph takes 11MB memory, the second takes 2MB memory, and in total cudagraphs take 45MB memory. after this PR:
the first graph takes 8MB memory, the second takes 0MB memory, and in total cudagraphs take 8MB memory. We can see the memory used after capturing all graphs (19464.05 MB) is almost the same as just capturing one graph previously (19464.02 MB). |
@rkooo567 I can take this PR if you are ok with that. Just please let me know! |
oh yeah please go ahead! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Thanks for the PR! This is a good finding. Please check my comments.
87e4fdc
to
8ff3b91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@youkaichao Is the CI failure related to this PR? |
investigating. |
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
Currently, each cudagraph for one batch size will incur additional memory cost of
TP * BATCH_SIZE * HIDDEN_SIZE
. Then the total memory footprint for all cudagraphs areTP * SUM(BATCH_SIZES) * HIDDEN_SIZE
.If we create output buffer too, the total memory footprint for all cudagraphs are
TP * MAX(BATCH_SIZES) * HIDDEN_SIZE
, which is constant, regardless of how many graphs we capture.