-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @fhl2000, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refines the full CUDA graph implementation within the system, enhancing performance and broadening compatibility with modern attention backends like FlashAttention 2 and FlashInfer. The core change involves capturing the full CUDA graph externally to the piecewise FX graph, which mitigates CPU overhead and allows for specialized graph handling for different attention stages (prefill-decode vs. pure decode). Additionally, it includes crucial bug fixes related to Marlin kernel compilation for specific GPU architectures and Pydantic type hinting.
Highlights
- Enhanced Full CUDA Graph Implementation: A new strategy is introduced where the full CUDA graph is captured outside the piecewise compiled FX graph, allowing for separate CUDA graph sets for pure decode and mixed prefill-decode stages. This design aims to reduce CPU overhead during eager execution and offers greater flexibility compared to previous flattened graph approaches.
- Improved Attention Backend Compatibility: Full CUDA graph support is extended to FlashAttention 2 (FA2) and FlashInfer. For FA2, separate CUDA graph captures are enabled for both prefill-decode and pure decode attention routines. For FlashInfer, persistent buffers and per-batch-size decode wrappers are utilized for pure decode stages, with a fallback to piecewise CUDA graphs for mixed prefill-decode.
- Marlin Kernel Compilation Fix: Resolved an issue where Marlin kernels were incorrectly compiled for incompatible GPU architectures (e.g., 8.7 instead of 8.9 for RTX 4090), preventing 'RuntimeError: CUDA error: no kernel image is available for execution on the device' during execution.
- Pydantic Type Hint Correction: Addressed Pydantic type checking errors by replacing
list[int]
withList[int]
in various MoE-related files for improved type compatibility. - Refined CUDA Graph Control Flags: Introduced
separate_attention_routine
inCompilationConfig
to enable distinct attention routines for full CUDA graph capturing. TheForwardContext
now usesskip_attention_cuda_graphs
andis_pure_decoding
to precisely control when and how CUDA graphs are applied, allowing for more granular optimization.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances vLLM's CUDA graph capabilities by introducing a new full CUDA graph implementation that wraps piecewise FX graphs, aiming to reduce CPU overhead and improve flexibility. It specifically adds support for FlashAttention v2 (FA2) and FlashInfer by enabling distinct CUDA graph captures for mixed prefill-decode and pure decode stages. Additionally, it includes important bug fixes for Pydantic type hint compatibility and Marlin kernel compilation on RTX 4090 GPUs. The changes improve correctness, efficiency, and maintainability of the codebase.
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]), | ||
desc="Capturing CUDA graphs (mix prefill-decode)", | ||
total=len(self.cudagraph_batch_sizes)): | ||
for _ in range( | ||
self.compilation_config.cudagraph_num_of_warmups): | ||
self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg) | ||
self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg) | ||
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | ||
is_pure_decoding=False) | ||
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | ||
is_pure_decoding=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop now explicitly captures CUDA graphs for the "mix prefill-decode (general)" usage, using capture_attn_cudagraph_general
and is_pure_decoding=False
. This is a clear separation of concerns in the CUDA graph capturing process, aligning with the new separate_attention_routine
flag.
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]), | |
desc="Capturing CUDA graphs (mix prefill-decode)", | |
total=len(self.cudagraph_batch_sizes)): | |
for _ in range( | |
self.compilation_config.cudagraph_num_of_warmups): | |
self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg) | |
self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg) | |
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | |
is_pure_decoding=False) | |
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | |
is_pure_decoding=False) | |
# Capture the mix prefill-decode (general usage) cudagraphs | |
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]), | |
desc="Capturing CUDA graphs (mix prefill-decode)", | |
total=len(self.cudagraph_batch_sizes)): | |
for _ in range( | |
self.compilation_config.cudagraph_num_of_warmups): | |
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | |
is_pure_decoding=False) | |
self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, | |
is_pure_decoding=False) |
|
||
@classmethod | ||
def get_fullgraph_wrapper_cls(cls) -> str: | ||
return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper" # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new class method get_fullgraph_wrapper_cls
is added to the CudaPlatform
. It returns the string path to the FullCudagraphWrapper
class, enabling the system to dynamically load the correct full CUDA graph wrapper implementation for CUDA devices. This is essential for the new full CUDA graph strategy.
@classmethod | |
def get_fullgraph_wrapper_cls(cls) -> str: | |
return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper" # noqa | |
@classmethod | |
def get_fullgraph_wrapper_cls(cls) -> str: | |
return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper" # noqa |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>
…m-project#18032) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: fhl <2410591650@qq.com>
Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com> Signed-off-by: fhl <2410591650@qq.com>
…oject#19583) Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
…9851) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: nie3e <adrcwiek@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>
…llm-project#19164) Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: fhl <2410591650@qq.com>
…t#19948) Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
…compatibility (vllm-project#19642) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
…#19057) Signed-off-by: amit <amit.man@gmail.com> Co-authored-by: Roger Wang <Rogerw0108@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
…llm-project#19544) Signed-off-by: jinqinn <goodqinjin@163.com> Signed-off-by: fhl <2410591650@qq.com>
…lm-compressor (vllm-project#19643) Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
…lm-project#19691) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: fhl <2410591650@qq.com>
…p when *all* transfer done (vllm-project#19874) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: fhl <2410591650@qq.com>
…t#19952) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
…presistent buffers Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
9038647
to
de50df0
Compare
This pull request has merge conflicts that must be resolved before it can be |
Purpose
1. This PR introduces a new implementation for full cuda graph, and adds support for FA2 and FlashInfer.
Previous limitations
The original design in PR #16072 is to set
compilation_config.splitting_ops
as an empty list and capture the full cudagraph inside the flattened fx graph, which supports FA3 only. In later PR #18581, full cudagraph support for FlashMLA only captures the pure decode stage, and bypasses the mix prefill-decode stages, i.e., it runs the eager code of the compiled flattened fx graph in this stage. However, from the profiling results(see below), I found this flattened graph has performance issues at eager call, which is about 2x slower on the cpu side than the compiled piecewise fx graph running (possibly an issue from Python). This can lead to potential performance degradation when the prefill stage of a small batch size.Also, considering that attention backends, like FA2, FlashInfer, and FlashMLA, have two distinct attention routines for prefill-decode stages and pure decode stages separately, which makes it difficult to contain all in a unified graph and only keeps one set of captured cudagraphs.
Solution of this PR.
So, the new trick is, we keep the piecewise compiled fx graph structure overall, but capture the full cudagraph outside the fx graph via a wrapper. With this at hand, we can dispatch to two sets of cudagraph. For the pure decode stage, directly using full cudagraphs since it is compatible with most attention backends. For mix prefill-decode stages, it can either fall back to piecewise cudagraph for incompatible routines in backends like FlashMLA and FlashInfer, or to use another set of full cudagraph for compatible backends(varlen supports in FA2).
Note that keeping the piecewise compiled fx graph is at least better than a full but flattened one from the viewpoint of reducing cpu overhead, even if we do not capture the mix prefill-decode stage. It is also flexible to switch between full cudagraph and piecewise cudagraph for future extension. For example, seamless fallback to piecewise cudagraph if cascade attention is needed.
The limitation is the increased startup time and more gpu memory required for the additional cudagraph capturing. Maybe we can optimize this by shrinking the list of batch sizes to be captured for the prefill-decode stage.
#profile on compiled flatten fx graph on eager execution, mix prefill-decode stage.
Takes roughly 56ms to fully launch the model. An additional 5ms latency in doing some safety checking before launching the first kernel. It seems Python is slow at executing the flattened and large module without submodules.

Note: the only way to use flatten fx graph in this PR is to hardcode the splitting_ops =[] in
set_splitting_ops_for_v1
(around line 4200 in vllm/config.py)#profile on compiled piecewise fx graph on eager execution, mix prefill-decode stage.
28 ms to fully launch, and the latency above almost disappears. In fact, they are hidden inside each submodule.

The patterns above are verified on two different machines (ignoring the gpu difference here as this is only related to cpu), tested on Qwen2.5-7B-Instruct-GPTQ-Int4 and profile benchmark_serving (sharegpt, unlimited request rate).
So, if a prefill batch size is a bit larger than the max capturing size (say 512) but not too big, the lower bound of model forward time is possibly bounded by cpu side, around 56ms in running the flattened graph, instead of 28ms for the piecewise one.
Details for supporting FA2:
The previous codes did not recognize the two routines under the FA2 code. It launches a standard varlen fwd kernel on mix prefill-decode batches. or launches another routine for pure decode batches, including an optimization for GQA/MQA and potential flash-decode kernels (split_kv >1). By setting max_query_len =1 or >1 on cuda capturing phase, we can correctly activate the desired attention routine, therefore to be correctly captured. (To be serious, the kernel for prefill-decode phase is, of course, compatible with pure decode, but is not fully optimized for decode phase. The actual reason PR #16072 did not support FA2 is a bug that the seq_lens is a zero tensor in the dummy_run in the early code, which bypasses launching any attention kernel at the capturing phase, leading to zero tensor outputs.)
Details for supporting FlashInfer:
Launching command examples:
For FA2:
For FlashInfer:
others:
FlashMLA: the compilation-config is
'{"full_cuda_graph":true,"separate_attention_routine":true}'
FA3: env set
VLLM_FLASH_ATTN_VERSION=3
and the compilation-config is'{"full_cuda_graph":true}'
2. Two minor fixes include:
(a) Pydantic raises type checking errors for
list[int]
andOption[list[int]]
. Replacing them withList[int]
andOption[List[int]]
can fix it.(b) When compiling from source code at 4090, it compiled incorrect marlin kernels for arch 8.7, which is incompatible with 4090 (arch 8.9). This will raise "RuntimeError: CUDA error: no kernel image is available for execution on the device" when using the marlin-related kernels. This PR has fixed this problem. Similar issues see #18835
Test Plan
benchmark serving, lm_eval performance of FA2 and FlashInfer
I have no plan to test FlashMLA and FA3 as no hopper gpu at hand, but it should be fine as the current design is compatible with them. However, it would be very nice if somebody could help test them.
Test Result
Summary of results
Output token throughput is imporved by 5% for FA2 and 2% for FlashInfer on Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4. ** TPOT is reduced by 2.9% and 3.1%, respectively**. The lm_evel has no changes for both.
Details
machine: A100 40G, torch2.6 cuda12.4
Benchmark serving command:
FA2 benchmark serving:
piecewise cudagraph before this PR
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9
full cudagraph + piecewise fx graph in this PR
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9 --compilation-config '{"full_cuda_graph": true,"separate_attention_routine": true}'
FA2 lm_eval
piecewise cudagraph before this PR
vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
full cudagraph + piecewise fx graph after this PR
vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9, 'compilation_config': {'full_cuda_graph': True, 'separate_attention_routine': True}}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
FlashInfer benchmark serving
piecewise cudagraph before this PR
VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9
full cudagraph + piecewise fx graph after this PR
VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9 --compilation-config '{"full_cuda_graph": true,"separate_attention_routine": true}'
FlashInfer lm_eval
piecewise cudagraph before this PR
vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
full cudagraph + piecewise fx graph after this PR
vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9, 'compilation_config': {'full_cuda_graph': True, 'separate_attention_routine': True}}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
One more thing, after merging some code from the main branch recently, I ran into a potential deadlock when testing this PR. This should be caused by an early merged code, and one draft PR #19927 seems to solve the problem.