[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

fhl2000 · 2025-06-25T02:45:19Z

Purpose

1. This PR introduces a new implementation for full cuda graph, and adds support for FA2 and FlashInfer.

Previous limitations

The original design in PR #16072 is to set compilation_config.splitting_ops as an empty list and capture the full cudagraph inside the flattened fx graph, which supports FA3 only. In later PR #18581, full cudagraph support for FlashMLA only captures the pure decode stage, and bypasses the mix prefill-decode stages, i.e., it runs the eager code of the compiled flattened fx graph in this stage. However, from the profiling results(see below), I found this flattened graph has performance issues at eager call, which is about 2x slower on the cpu side than the compiled piecewise fx graph running (possibly an issue from Python). This can lead to potential performance degradation when the prefill stage of a small batch size.

Also, considering that attention backends, like FA2, FlashInfer, and FlashMLA, have two distinct attention routines for prefill-decode stages and pure decode stages separately, which makes it difficult to contain all in a unified graph and only keeps one set of captured cudagraphs.

Solution of this PR.

So, the new trick is, we keep the piecewise compiled fx graph structure overall, but capture the full cudagraph outside the fx graph via a wrapper. With this at hand, we can dispatch to two sets of cudagraph. For the pure decode stage, directly using full cudagraphs since it is compatible with most attention backends. For mix prefill-decode stages, it can either fall back to piecewise cudagraph for incompatible routines in backends like FlashMLA and FlashInfer, or to use another set of full cudagraph for compatible backends(varlen supports in FA2).

Note that keeping the piecewise compiled fx graph is at least better than a full but flattened one from the viewpoint of reducing cpu overhead, even if we do not capture the mix prefill-decode stage. It is also flexible to switch between full cudagraph and piecewise cudagraph for future extension. For example, seamless fallback to piecewise cudagraph if cascade attention is needed.

The limitation is the increased startup time and more gpu memory required for the additional cudagraph capturing. Maybe we can optimize this by shrinking the list of batch sizes to be captured for the prefill-decode stage.

#profile on compiled flatten fx graph on eager execution, mix prefill-decode stage.

Takes roughly 56ms to fully launch the model. An additional 5ms latency in doing some safety checking before launching the first kernel. It seems Python is slow at executing the flattened and large module without submodules.

Note: the only way to use flatten fx graph in this PR is to hardcode the splitting_ops =[] in set_splitting_ops_for_v1 (around line 4200 in vllm/config.py)

#profile on compiled piecewise fx graph on eager execution, mix prefill-decode stage.

28 ms to fully launch, and the latency above almost disappears. In fact, they are hidden inside each submodule.

The patterns above are verified on two different machines (ignoring the gpu difference here as this is only related to cpu), tested on Qwen2.5-7B-Instruct-GPTQ-Int4 and profile benchmark_serving (sharegpt, unlimited request rate).

So, if a prefill batch size is a bit larger than the max capturing size (say 512) but not too big, the lower bound of model forward time is possibly bounded by cpu side, around 56ms in running the flattened graph, instead of 28ms for the piecewise one.

Details for supporting FA2:

The previous codes did not recognize the two routines under the FA2 code. It launches a standard varlen fwd kernel on mix prefill-decode batches. or launches another routine for pure decode batches, including an optimization for GQA/MQA and potential flash-decode kernels (split_kv >1). By setting max_query_len =1 or >1 on cuda capturing phase, we can correctly activate the desired attention routine, therefore to be correctly captured. (To be serious, the kernel for prefill-decode phase is, of course, compatible with pure decode, but is not fully optimized for decode phase. The actual reason PR #16072 did not support FA2 is a bug that the seq_lens is a zero tensor in the dummy_run in the early code, which bypasses launching any attention kernel at the capturing phase, leading to zero tensor outputs.)

FA2 runs both mix prefill-decode and pure decode batches at full cudagraph, but on two separate sets of cudagraphs.

Details for supporting FlashInfer:

Using the persistent buffer trick.
Create many decode_warpers, one for a cudagraph batch size, as this is required by the FlashInfer API.
Run pure decode batches at full cudagraph, and fall back to piecewise cudagraph at mix prefill-decode batches.

Launching command examples:

For FA2:

VLLM_FLASH_ATTN_VERSION=2 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --compilation-config '{"full_cuda_graph":true, "separate_attention_routine":true}'

For FlashInfer:

VLLM_ATTENTION_BACKEND=FLASHINFER python -m ... --compilation-config '{"full_cuda_graph":true,"separate_attention_routine":true}'

others:
FlashMLA: the compilation-config is '{"full_cuda_graph":true,"separate_attention_routine":true}'
FA3: env set VLLM_FLASH_ATTN_VERSION=3 and the compilation-config is '{"full_cuda_graph":true}'

2. Two minor fixes include：

(a) Pydantic raises type checking errors for list[int] and Option[list[int]]. Replacing them with List[int] and Option[List[int]] can fix it.

(b) When compiling from source code at 4090, it compiled incorrect marlin kernels for arch 8.7, which is incompatible with 4090 (arch 8.9). This will raise "RuntimeError: CUDA error: no kernel image is available for execution on the device" when using the marlin-related kernels. This PR has fixed this problem. Similar issues see #18835

Test Plan

benchmark serving, lm_eval performance of FA2 and FlashInfer

I have no plan to test FlashMLA and FA3 as no hopper gpu at hand, but it should be fine as the current design is compatible with them. However, it would be very nice if somebody could help test them.

Test Result

Summary of results

Output token throughput is imporved by 5% for FA2 and 2% for FlashInfer on Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4. ** TPOT is reduced by 2.9% and 3.1%, respectively**. The lm_evel has no changes for both.

Details

machine: A100 40G, torch2.6 cuda12.4

Benchmark serving command:

python benchmarks/benchmark_serving.py --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 --request-rate 20

FA2 benchmark serving:

piecewise cudagraph before this PR

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9

============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 11.41
Total input tokens: 23260
Total generated tokens: 21657
Request throughput (req/s): 8.77
Output token throughput (tok/s): 1898.67
Total Token throughput (tok/s): 3937.88
---------------Time to First Token----------------
Mean TTFT (ms): 76.37
Median TTFT (ms): 71.08
P99 TTFT (ms): 191.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.08
Median TPOT (ms): 15.22
P99 TPOT (ms): 67.68
---------------Inter-token Latency----------------
Mean ITL (ms): 13.45
Median ITL (ms): 11.05
P99 ITL (ms): 72.61
==================================================

full cudagraph + piecewise fx graph in this PR

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9 --compilation-config '{"full_cuda_graph": true,"separate_attention_routine": true}'

============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 10.87
Total input tokens: 23260
Total generated tokens: 21657
Request throughput (req/s): 9.20
Output token throughput (tok/s): 1992.27
Total Token throughput (tok/s): 4132.01
---------------Time to First Token----------------
Mean TTFT (ms): 78.69
Median TTFT (ms): 75.10
P99 TTFT (ms): 195.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.57
Median TPOT (ms): 14.78
P99 TPOT (ms): 78.21
---------------Inter-token Latency----------------
Mean ITL (ms): 12.83
Median ITL (ms): 10.34
P99 ITL (ms): 72.37
==================================================

FA2 lm_eval

piecewise cudagraph before this PR

vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8074	±	0.0109
		strict-match	5	exact_match	↑	0.7619	±	0.0117

full cudagraph + piecewise fx graph after this PR

vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9, 'compilation_config': {'full_cuda_graph': True, 'separate_attention_routine': True}}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8074	±	0.0109
		strict-match	5	exact_match	↑	0.7619	±	0.0117

FlashInfer benchmark serving

piecewise cudagraph before this PR

VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9

============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 11.36
Total input tokens: 23260
Total generated tokens: 21660
Request throughput (req/s): 8.81
Output token throughput (tok/s): 1907.38
Total Token throughput (tok/s): 3955.65
---------------Time to First Token----------------
Mean TTFT (ms): 73.61
Median TTFT (ms): 69.59
P99 TTFT (ms): 184.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.85
Median TPOT (ms): 15.13
P99 TPOT (ms): 65.75
---------------Inter-token Latency----------------
Mean ITL (ms): 13.34
Median ITL (ms): 11.09
P99 ITL (ms): 71.82
==================================================

full cudagraph + piecewise fx graph after this PR

VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.9 --compilation-config '{"full_cuda_graph": true,"separate_attention_routine": true}'

============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 11.13
Total input tokens: 23260
Total generated tokens: 21660
Request throughput (req/s): 8.99
Output token throughput (tok/s): 1946.35
Total Token throughput (tok/s): 4036.48
---------------Time to First Token----------------
Mean TTFT (ms): 76.03
Median TTFT (ms): 67.04
P99 TTFT (ms): 192.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.34
Median TPOT (ms): 14.96
P99 TPOT (ms): 58.86
---------------Inter-token Latency----------------
Mean ITL (ms): 13.11
Median ITL (ms): 10.71
P99 ITL (ms): 71.69
==================================================

FlashInfer lm_eval

piecewise cudagraph before this PR

vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8105	±	0.0108
		strict-match	5	exact_match	↑	0.7635	±	0.0117

full cudagraph + piecewise fx graph after this PR

vllm ({'pretrained': '/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4', 'gpu_memory_utilization': 0.9, 'compilation_config': {'full_cuda_graph': True, 'separate_attention_routine': True}}), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8105	±	0.0108
		strict-match	5	exact_match	↑	0.7635	±	0.0117

One more thing, after merging some code from the main branch recently, I ran into a potential deadlock when testing this PR. This should be caused by an early merged code, and one draft PR #19927 seems to solve the problem.

gemini-code-assist

Summary of Changes

Hello @fhl2000, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the full CUDA graph implementation within the system, enhancing performance and broadening compatibility with modern attention backends like FlashAttention 2 and FlashInfer. The core change involves capturing the full CUDA graph externally to the piecewise FX graph, which mitigates CPU overhead and allows for specialized graph handling for different attention stages (prefill-decode vs. pure decode). Additionally, it includes crucial bug fixes related to Marlin kernel compilation for specific GPU architectures and Pydantic type hinting.

Highlights

Enhanced Full CUDA Graph Implementation: A new strategy is introduced where the full CUDA graph is captured outside the piecewise compiled FX graph, allowing for separate CUDA graph sets for pure decode and mixed prefill-decode stages. This design aims to reduce CPU overhead during eager execution and offers greater flexibility compared to previous flattened graph approaches.
Improved Attention Backend Compatibility: Full CUDA graph support is extended to FlashAttention 2 (FA2) and FlashInfer. For FA2, separate CUDA graph captures are enabled for both prefill-decode and pure decode attention routines. For FlashInfer, persistent buffers and per-batch-size decode wrappers are utilized for pure decode stages, with a fallback to piecewise CUDA graphs for mixed prefill-decode.
Marlin Kernel Compilation Fix: Resolved an issue where Marlin kernels were incorrectly compiled for incompatible GPU architectures (e.g., 8.7 instead of 8.9 for RTX 4090), preventing 'RuntimeError: CUDA error: no kernel image is available for execution on the device' during execution.
Pydantic Type Hint Correction: Addressed Pydantic type checking errors by replacing list[int] with List[int] in various MoE-related files for improved type compatibility.
Refined CUDA Graph Control Flags: Introduced separate_attention_routine in CompilationConfig to enable distinct attention routines for full CUDA graph capturing. The ForwardContext now uses skip_attention_cuda_graphs and is_pure_decoding to precisely control when and how CUDA graphs are applied, allowing for more granular optimization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances vLLM's CUDA graph capabilities by introducing a new full CUDA graph implementation that wraps piecewise FX graphs, aiming to reduce CPU overhead and improve flexibility. It specifically adds support for FlashAttention v2 (FA2) and FlashInfer by enabling distinct CUDA graph captures for mixed prefill-decode and pure decode stages. Additionally, it includes important bug fixes for Pydantic type hint compatibility and Marlin kernel compilation on RTX 4090 GPUs. The changes improve correctness, efficiency, and maintainability of the codebase.

gemini-code-assist · 2025-06-25T02:48:32Z

vllm/v1/worker/gpu_model_runner.py

+            for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]),
+                                   desc="Capturing CUDA graphs (mix prefill-decode)",
                                   total=len(self.cudagraph_batch_sizes)):
                for _ in range(
                        self.compilation_config.cudagraph_num_of_warmups):
-                    self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
-                self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
+                    self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, 
+                                    is_pure_decoding=False)
+                self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general, 
+                                is_pure_decoding=False)


This loop now explicitly captures CUDA graphs for the "mix prefill-decode (general)" usage, using capture_attn_cudagraph_general and is_pure_decoding=False. This is a clear separation of concerns in the CUDA graph capturing process, aligning with the new separate_attention_routine flag.

Suggested change

for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]),

desc="Capturing CUDA graphs (mix prefill-decode)",

total=len(self.cudagraph_batch_sizes)):

for _ in range(

self.compilation_config.cudagraph_num_of_warmups):

self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)

self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)

self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general,

is_pure_decoding=False)

self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general,

is_pure_decoding=False)

# Capture the mix prefill-decode (general usage) cudagraphs

for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes[start_idx:]),

desc="Capturing CUDA graphs (mix prefill-decode)",

total=len(self.cudagraph_batch_sizes)):

for _ in range(

self.compilation_config.cudagraph_num_of_warmups):

self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general,

is_pure_decoding=False)

self._dummy_run(num_tokens, capture_attn_cudagraph=capture_attn_cudagraph_general,

is_pure_decoding=False)

gemini-code-assist · 2025-06-25T02:48:33Z

vllm/platforms/cuda.py

+
+    @classmethod
+    def get_fullgraph_wrapper_cls(cls) -> str:
+        return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper"  # noqa


This new class method get_fullgraph_wrapper_cls is added to the CudaPlatform. It returns the string path to the FullCudagraphWrapper class, enabling the system to dynamically load the correct full CUDA graph wrapper implementation for CUDA devices. This is essential for the new full CUDA graph strategy.

Suggested change

@classmethod

def get_fullgraph_wrapper_cls(cls) -> str:

return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper" # noqa

@classmethod

def get_fullgraph_wrapper_cls(cls) -> str:

return "vllm.compilation.cuda_piecewise_backend.FullCudagraphWrapper" # noqa

github-actions · 2025-06-25T02:48:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>

…m-project#18032) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: fhl <2410591650@qq.com>

Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: Qiang Li <qiang.li2@amd.com> Signed-off-by: fhl <2410591650@qq.com>

…oject#19583) Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

…9851) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: nie3e <adrcwiek@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>

…llm-project#19164) Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: fhl <2410591650@qq.com>

…t#19948) Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: fhl <2410591650@qq.com>

…compatibility (vllm-project#19642) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: fhl <2410591650@qq.com>

…#19057) Signed-off-by: amit <amit.man@gmail.com> Co-authored-by: Roger Wang <Rogerw0108@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

…llm-project#19544) Signed-off-by: jinqinn <goodqinjin@163.com> Signed-off-by: fhl <2410591650@qq.com>

…lm-compressor (vllm-project#19643) Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

…lm-project#19691) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: fhl <2410591650@qq.com>

…p when *all* transfer done (vllm-project#19874) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: fhl <2410591650@qq.com>

…t#19952) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

…presistent buffers Signed-off-by: fhl <2410591650@qq.com>

Signed-off-by: fhl <2410591650@qq.com>

mergify · 2025-06-25T03:23:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fhl2000.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fhl2000 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, tlrmchlsmth and mgoin as code owners June 25, 2025 02:45

mergify bot added ci/build rocm Related to AMD ROCm v1 labels Jun 25, 2025

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

fhl2000 marked this pull request as draft June 25, 2025 02:46

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

fhl2000 and others added 15 commits June 25, 2025 11:12

full_cudagraph support for FA2

6302a7d

Signed-off-by: fhl <2410591650@qq.com>

fix Typing error: replace some list[int] to List[int]

d5943f0

Signed-off-by: fhl <2410591650@qq.com>

minor fix

4c6fc32

Signed-off-by: fhl <2410591650@qq.com>

fix the arch support in CMakeLists.txt to include 8.9

7339260

Signed-off-by: fhl <2410591650@qq.com>

[CI][Neuron] Fail and exit on first error (vllm-project#19622)

0be8df3

Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>

[Benchmark] Fix Value of type "SampleRequest" is not indexable (vll…

f6f4e71

…m-project#18032) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: fhl <2410591650@qq.com>

[Chore]: qwen3-moe-type-hints-mistake (vllm-project#19860)

376ce81

Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com> Signed-off-by: fhl <2410591650@qq.com>

[Bugfix] Enable PP with AITER+V1 (vllm-project#19822)

3c77ffa

Signed-off-by: Qiang Li <qiang.li2@amd.com> Signed-off-by: fhl <2410591650@qq.com>

[Bugfix][Ray] Set the cuda context eagerly in the ray worker (vllm-pr…

e12a111

…oject#19583) Signed-off-by: fhl <2410591650@qq.com>

[Misc] update cuda version (vllm-project#19526)

5f24762

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[Misc] refactor example - openai_transcription_client (vllm-project#1…

1c4333d

…9851) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[Kernel] correct cpu worker function parameter type (vllm-project#19745)

4c1e40c

Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[Fix] import regex instead of re (vllm-project#19875)

aa2ce41

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: fhl <2410591650@qq.com>

[Model] GPT2ForSequenceClassification model (vllm-project#19663)

7047d65

Signed-off-by: nie3e <adrcwiek@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: fhl <2410591650@qq.com>

[custom_op][vllm-plugin] update custom_op class to use op_registry (v…

672ea2a

…llm-project#19164) Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: fhl <2410591650@qq.com>

ywang96 and others added 15 commits June 25, 2025 11:12

[Misc] Update model-specific PR tagging (vllm-project#19949)

ed3cba2

Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: fhl <2410591650@qq.com>

[Misc] Simplify vllm bench cli subcommand implementation (vllm-projec…

2362d8b

…t#19948) Signed-off-by: fhl <2410591650@qq.com>

[Chore] dedup logs (vllm-project#19955)

78a9270

Signed-off-by: fhl <2410591650@qq.com>

[BugFix] Add an env to disable moe chunking to work around compile in…

2c00d23

…compatibility (vllm-project#19642) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: fhl <2410591650@qq.com>

[Perf][CLI] Improve overall startup time (vllm-project#19941)

708948b

Signed-off-by: fhl <2410591650@qq.com>

[Core] feat: Implement Priority Scheduling in V1 Engine (vllm-project…

6dc86aa

…#19057) Signed-off-by: amit <amit.man@gmail.com> Co-authored-by: Roger Wang <Rogerw0108@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[Misc] Configurable timeout for execute_model RPC calls via env var (v…

cba9f52

…llm-project#19544) Signed-off-by: jinqinn <goodqinjin@163.com> Signed-off-by: fhl <2410591650@qq.com>

Fix(models/siglip): Add compatibility for Gemma models quantized by l…

ceba56c

…lm-compressor (vllm-project#19643) Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[doc] Fold long code blocks to improve readability (vllm-project#19926)

de1b605

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

[P/D][NixlConnector] Support tp_size > num_kv_heads deployments (vl…

016f49d

…lm-project#19691) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: fhl <2410591650@qq.com>

[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned u…

d061382

…p when *all* transfer done (vllm-project#19874) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: fhl <2410591650@qq.com>

[Doc] Update V1 status for decoder-only embedding models (vllm-projec…

d29077a

…t#19952) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: fhl <2410591650@qq.com>

[doc] use MkDocs collapsible blocks - supplement (vllm-project#19973)

fc38dc1

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

fix correctness of Flashinfer attention on peicewise cudagraph using …

b9694f4

…presistent buffers Signed-off-by: fhl <2410591650@qq.com>

Minor adjustment of capturing size

de50df0

Signed-off-by: fhl <2410591650@qq.com>

fhl2000 force-pushed the FA2_full_cudagraph branch from 9038647 to de50df0 Compare June 25, 2025 03:23

mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) performance Performance-related issues qwen Related to Qwen models structured-output tool-calling labels Jun 25, 2025

github-project-automation bot added this to Tool Calling and Structured Output Jun 25, 2025

mergify bot added the needs-rebase label Jun 25, 2025

fhl2000 closed this Jun 25, 2025

github-project-automation bot moved this to Done in Structured Output Jun 25, 2025

github-project-automation bot moved this to Done in Tool Calling Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

Uh oh!

fhl2000 commented Jun 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 25, 2025

Uh oh!

gemini-code-assist bot Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

mergify bot commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

Uh oh!

Conversation

fhl2000 commented Jun 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

1. This PR introduces a new implementation for full cuda graph, and adds support for FA2 and FlashInfer.

Previous limitations

Solution of this PR.

#profile on compiled flatten fx graph on eager execution, mix prefill-decode stage.

#profile on compiled piecewise fx graph on eager execution, mix prefill-decode stage.

Details for supporting FA2:

Details for supporting FlashInfer:

Launching command examples:

2. Two minor fixes include：

Test Plan

Test Result

Summary of results

Details

FA2 benchmark serving:

piecewise cudagraph before this PR

full cudagraph + piecewise fx graph in this PR

FA2 lm_eval

piecewise cudagraph before this PR

full cudagraph + piecewise fx graph after this PR

FlashInfer benchmark serving

piecewise cudagraph before this PR

full cudagraph + piecewise fx graph after this PR

FlashInfer lm_eval

piecewise cudagraph before this PR

full cudagraph + piecewise fx graph after this PR

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

mergify bot commented Jun 25, 2025

Uh oh!

Uh oh!

fhl2000 commented Jun 25, 2025 •

edited by github-actions bot

Loading