Move query quantization to attention layer for Flashinfer & Triton. #26534

adabeyta · 2025-10-09T21:10:40Z

Purpose

Implements refactor of quantization to the attention layer for triton and flashinfer, resolves feature request #25584

Test Plan

Spin up server:

Flashinfer:

VLLM_ATTENTION_BACKEND=FLASHINFER  vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --compilation-config '{"compile_sizes": [1,2,4,8], "cudagraph_capture_sizes": [1,2,4,8], "cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --no-enable-prefix-caching

Triton:

VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --compilation-config '{"compile_sizes": [1,2,4,8], "cudagraph_capture_sizes": [1,2,4,8], "cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --no-enable-prefix-caching

Benchmark:

vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct  \
    --dataset-name sonnet \
    --dataset-path vllm/benchmarks/sonnet.txt \
    --sonnet-input-len 1000 \
    --sonnet-output-len 200 \
    --port 8000 \
    --num-prompts 20 \
    --max-concurrency 1

Accuracy

To ensure there is no accidental accuracy degradation we also run the following for Flashinfer & Triton with kv_cache_dtype in {auto,fp8} both on this PR and on mainline. We also run without enforce_eager=True for the FP8 variants

lm_eval \
  --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=auto,tensor_parallel_size=1,enforce_eager=True \
  --tasks gsm8k \
  --batch_size

Test Results

PR + FlashInfer
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  16.81     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.19      
Output token throughput (tok/s):         237.97    
Peak output token throughput (tok/s):    240.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1323.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          20.13     
Median TTFT (ms):                        19.97     
P99 TTFT (ms):                           23.20     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.12      
Median TPOT (ms):                        4.13      
P99 TPOT (ms):                           4.15      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.12      
Median ITL (ms):                         4.12      
P99 ITL (ms):                            4.61      
==================================================

PR + Triton
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  17.23     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.16      
Output token throughput (tok/s):         232.19    
Peak output token throughput (tok/s):    235.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1291.45   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.86     
Median TTFT (ms):                        24.77     
P99 TTFT (ms):                           31.37     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.20      
Median TPOT (ms):                        4.20      
P99 TPOT (ms):                           4.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.20      
Median ITL (ms):                         4.20      
P99 ITL (ms):                            4.58      
==================================================

Main + FlashInfer
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  18.15     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.10      
Output token throughput (tok/s):         220.42    
Peak output token throughput (tok/s):    223.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1225.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.18     
Median TTFT (ms):                        24.08     
P99 TTFT (ms):                           29.89     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.44      
Median TPOT (ms):                        4.44      
P99 TPOT (ms):                           4.46      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.44      
Median ITL (ms):                         4.43      
P99 ITL (ms):                            5.01      
==================================================

Main + Triton
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  17.42     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.15      
Output token throughput (tok/s):         229.59    
Peak output token throughput (tok/s):    234.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1276.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          27.61     
Median TTFT (ms):                        26.97     
P99 TTFT (ms):                           37.66     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.24      
Median TPOT (ms):                        4.24      
P99 TPOT (ms):                           4.32      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.24      
Median ITL (ms):                         4.23      
P99 ITL (ms):                            4.94      
==================================================

Accuracy on GSM8k

This PR shows no accuracy variation compared to mainline vllm.

PR + Triton + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7817|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|

PR + FlashInfer + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7801|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7551|±  |0.0118|


Main + FlashInfer + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7854|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7635|±  |0.0118|


Main + Triton + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7817|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|


PR + Triton + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7771|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0119|

PR + FlashInfer + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7665|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7483|±  |0.0120|

Main + FlashInfer + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7665|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7483|±  |0.0118|


Main + Triton + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7771|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0118|


PR + Triton + fp8 (w/ compile)
Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7460|±  |0.0120|


PR + FlashInfer + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7756|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|

Main + FlashInfer + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7680|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7437|±  |0.0118|


Main + Triton + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7672|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7468|±  |0.0118|

mergify · 2025-10-09T21:11:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adabeyta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request refactors the query quantization logic for the Flashinfer and Triton attention backends, moving it from the backend implementation to the higher-level attention layer. This is a positive change for code structure and enables potential compiler fusions. While the changes for the Flashinfer backend appear correct, the removal of a critical assertion for the Triton backend is concerning. This assertion enforced that the query quantization scale must be 1.0, a limitation of the Triton kernel. Its removal could lead to silent correctness issues if not handled in the new quantization logic. I have added a critical review comment to highlight this potential issue.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: adabeyta <aabeyta@redhat.com>

mergify · 2025-10-09T22:02:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adabeyta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…input dynamic for FlashInfer Signed-off-by: adabeyta <aabeyta@redhat.com>

elvischenv · 2025-10-11T13:34:37Z

vllm/v1/attention/backends/flashinfer.py

@@ -157,6 +144,11 @@ def trtllm_prefill_attn_kvfp8_dequant(
 class FlashInferBackend(AttentionBackend):
    accept_output_buffer: bool = True

+    @property
+    def supports_quant_query_input(self) -> bool:
+        return supports_trtllm_attention(


You may need to rebase or merge main and resolve the import issue

Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>

vllm/attention/layer.py

vllm/v1/attention/backends/flashinfer.py

ProExpertProg · 2025-10-13T13:00:43Z

vllm/attention/layer.py

            # which causes decoding overheads
            assert self.kv_cache_dtype in {"fp8", "fp8_e4m3"}
-            query, _ = self.query_quant(query, self._q_scale)
+            if not hasattr(


I don't think this will work; attention metadata is not set during the profile run when we compile. Instead, we should have a more robust way of checking, likely by calling supports_quant_query_input on the AttentionImpl object

ProExpertProg · 2025-10-13T13:01:34Z

vllm/v1/attention/backends/triton_attn.py

-                query = query.reshape((num_tokens, num_heads, head_size))
+                "A non 1.0 q_scale is not currently supported.")
+
+            # Query quantization is now handled in the attention layer


No need for this comment, just remove

vllm/attention/backends/abstract.py

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>

pavanimajety · 2025-10-14T20:50:52Z

@adabeyta Any analysis on why we are seeing lower toks/sec with enhanced fusion? Even without a custom kernel, the fact that rope + Quant can be jitted to a triton kernel should give us slightly higher perf, correct?

ProExpertProg · 2025-10-15T00:24:06Z

@adabeyta test failure looks related, it's possible this change breaks the fusion test. Could you run locally to check? Also worth running a model E2E to make sure fusion happens E2E (e2e tests coming soon in #24604)

Signed-off-by: adabeyta <aabeyta@redhat.com>

ProExpertProg

I see now we actually lost performance with this; we should make sure we gain and not lose performance.

ProExpertProg

Wow, those are some insane numbers... good work!

adabeyta · 2025-10-15T23:20:16Z

@adabeyta Any analysis on why we are seeing lower toks/sec with enhanced fusion? Even without a custom kernel, the fact that rope + Quant can be jitted to a triton kernel should give us slightly higher perf, correct?

@pavanimajety Updated with new perf numbers. We're seeing better
performance across both Triton and FlashInfer backends (up to 8% throughput
improvements). The earlier regression was from an intermediate commit
before the gating logic was added.

pavanimajety · 2025-10-16T00:39:01Z

Great work, thanks for the update!

…llm-project#26534) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

…llm-project#26534) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…llm-project#26534) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

…llm-project#26534) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#26534) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

adabeyta requested review from mgoin and tdoublep as code owners October 9, 2025 21:10

mergify bot added v1 needs-rebase labels Oct 9, 2025

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 9, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

adabeyta force-pushed the q_quant_attn_mv branch from 0316471 to 5cc9707 Compare October 9, 2025 21:18

mergify bot removed the needs-rebase label Oct 9, 2025

Move query quant to attn layer for flashinfer & triton.

bea52f6

Signed-off-by: adabeyta <aabeyta@redhat.com>

adabeyta force-pushed the q_quant_attn_mv branch from 5c22f29 to bea52f6 Compare October 9, 2025 22:02

mergify bot added the needs-rebase label Oct 9, 2025

ProExpertProg added this to the vllm==v0.12.0/torch==2.9.0 compilation improvements milestone Oct 10, 2025

Gate query quantization on q_data_type and make supports_quant_query_…

d20e023

…input dynamic for FlashInfer Signed-off-by: adabeyta <aabeyta@redhat.com>

adabeyta requested a review from LucasWilkinson as a code owner October 10, 2025 17:43

adabeyta requested a review from ProExpertProg October 10, 2025 17:46

elvischenv reviewed Oct 11, 2025

View reviewed changes

Merge branch 'main' into q_quant_attn_mv

f79d865

Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>

mergify bot removed the needs-rebase label Oct 12, 2025

ProExpertProg reviewed Oct 13, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Oct 13, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

ProExpertProg reviewed Oct 13, 2025

View reviewed changes

adabeyta requested review from WoosukKwon, alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners October 13, 2025 18:21

adabeyta requested a review from ProExpertProg October 14, 2025 17:19

Merge branch 'main' into q_quant_attn_mv

b0478dc

ProExpertProg reviewed Oct 14, 2025

View reviewed changes

vllm/attention/backends/abstract.py Show resolved Hide resolved

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025

Add todo for adding support to more backends.

c28f36b

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>

ProExpertProg approved these changes Oct 14, 2025

View reviewed changes

adabeyta added 2 commits October 14, 2025 14:10

Merge branch 'main' into q_quant_attn_mv

633b159

Merge branch 'main' into q_quant_attn_mv

543a8fe

ProExpertProg mentioned this pull request Oct 15, 2025

[Feature]: Move query quantization to attention layer for all backends supporting query quantization. #25584

Closed

1 task

adabeyta added 2 commits October 15, 2025 17:14

Update fusion attn UT to properly address query-quant

a1f4117

Signed-off-by: adabeyta <aabeyta@redhat.com>

Merge branch 'main' into q_quant_attn_mv

c3a9200

ProExpertProg approved these changes Oct 15, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) October 15, 2025 17:23

ProExpertProg disabled auto-merge October 15, 2025 17:24

ProExpertProg requested changes Oct 15, 2025

View reviewed changes

ProExpertProg approved these changes Oct 15, 2025

View reviewed changes

ProExpertProg merged commit 0a9ef0c into vllm-project:main Oct 15, 2025
51 checks passed

Uh oh!

Move query quantization to attention layer for Flashinfer & Triton. #26534

Move query quantization to attention layer for Flashinfer & Triton. #26534

Uh oh!

Conversation

adabeyta commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Accuracy

Test Results

Accuracy on GSM8k

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

elvischenv Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pavanimajety commented Oct 14, 2025

Uh oh!

ProExpertProg commented Oct 15, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adabeyta commented Oct 15, 2025

Uh oh!

pavanimajety commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adabeyta commented Oct 9, 2025 •

edited by github-actions bot

Loading