Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint #5074

Merged
merged 10 commits into from
Jun 9, 2024

Conversation

youkaichao
Copy link
Member

Currently, each cudagraph for one batch size will incur additional memory cost of TP * BATCH_SIZE * HIDDEN_SIZE . Then the total memory footprint for all cudagraphs are TP * SUM(BATCH_SIZES) * HIDDEN_SIZE.

If we create output buffer too, the total memory footprint for all cudagraphs are TP * MAX(BATCH_SIZES) * HIDDEN_SIZE, which is constant, regardless of how many graphs we capture.

@youkaichao
Copy link
Member Author

Performance-wise, this adds one memory copy operation into cudagraph. This should be negligible, as the operation will also be recorded in the cudagraph, and the copy size is quite small.

I run the benchmark, and find the throughput diff is within run-to-run variations.

Before this PR:
Throughput: 18.57 requests/s, 9510.07 tokens/s

After this PR:
Throughput: 18.59 requests/s, 9516.03 tokens/s

@youkaichao youkaichao changed the title [Core][CUDA Graph] add output buffer for cudagraph [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint May 27, 2024
@sfc-gh-hazhang
Copy link
Contributor

I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related?

@youkaichao
Copy link
Member Author

I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related?

Yes it is possible. To cleverly use cudagraph with multiple sizes, we need some technique like sharing buffers, just as what we do for inputs, and what this PR does for outputs.

@youkaichao
Copy link
Member Author

youkaichao commented May 27, 2024

If this PR is merged, more cuda graphs can be captured at no additional memory cost, which leads to further questions:

  • should we capture as many cuda graphs as we can (all batch sizes up until the maximum), so that we don't need padding anymore?
  • should we reduce _BATCH_SIZE_ALIGNMENT from 8 to 4 or 2 , so that less computation is wasted on padding?
  • should we align the batchsize used for tuning moe kernels with the cudagraph?

for bs in [
1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536,
2048, 3072, 4096
]:

cc @WoosukKwon for cudagraph
cc @pcmoritz for moe kernels

Update: I realize that cudagraph will cost some memory anyway . Therefore we don't need to actively capture many cudagraphs.

@youkaichao
Copy link
Member Author

For llama2-7b:

before this PR:

before capture graph with batch size 256 Used Memory: 19453.89 MB
after capture graph with batch size 256 Used Memory: 19464.02 MB
before capture graph with batch size 248 Used Memory: 19464.02 MB
after capture graph with batch size 248 Used Memory: 19465.95 MB
before capture graph with batch size 1 Used Memory: 19498.69 MB
after capture graph with batch size 1 Used Memory: 19498.70 MB
INFO 05-27 23:21:09 model_runner.py:920] Graph capturing finished in 6 secs.

the first graph takes 11MB memory, the second takes 2MB memory, and in total cudagraphs take 45MB memory.

after this PR:

before capture graph with batch size 256 Used Memory: 19455.89 MB
after capture graph with batch size 256 Used Memory: 19464.02 MB
before capture graph with batch size 248 Used Memory: 19464.02 MB
after capture graph with batch size 248 Used Memory: 19464.02 MB
...
before capture graph with batch size 1 Used Memory: 19464.05 MB
after capture graph with batch size 1 Used Memory: 19464.05 MB
INFO 05-27 23:18:21 model_runner.py:927] Graph capturing finished in 8 secs.

the first graph takes 8MB memory, the second takes 0MB memory, and in total cudagraphs take 8MB memory. We can see the memory used after capturing all graphs (19464.05 MB) is almost the same as just capturing one graph previously (19464.02 MB).

@rkooo567 rkooo567 self-assigned this Jun 4, 2024
@WoosukKwon
Copy link
Collaborator

@rkooo567 I can take this PR if you are ok with that. Just please let me know!

@rkooo567
Copy link
Collaborator

rkooo567 commented Jun 4, 2024

oh yeah please go ahead!

@WoosukKwon WoosukKwon self-assigned this Jun 6, 2024
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao Thanks for the PR! This is a good finding. Please check my comments.

vllm/worker/model_runner.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@WoosukKwon
Copy link
Collaborator

@youkaichao Is the CI failure related to this PR?

@youkaichao
Copy link
Member Author

@youkaichao Is the CI failure related to this PR?

investigating.

@youkaichao youkaichao merged commit 0373e18 into vllm-project:main Jun 9, 2024
103 checks passed
@youkaichao youkaichao deleted the cudagraph_save_memory branch June 9, 2024 02:14
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jun 10, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 11, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants