Skip to content

Conversation

@adabeyta
Copy link
Contributor

@adabeyta adabeyta commented Oct 9, 2025

Purpose

Implements refactor of quantization to the attention layer for triton and flashinfer, resolves feature request #25584

Test Plan

Spin up server:

Flashinfer:

VLLM_ATTENTION_BACKEND=FLASHINFER  vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --compilation-config '{"compile_sizes": [1,2,4,8], "cudagraph_capture_sizes": [1,2,4,8], "cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --no-enable-prefix-caching

Triton:

VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --compilation-config '{"compile_sizes": [1,2,4,8], "cudagraph_capture_sizes": [1,2,4,8], "cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --no-enable-prefix-caching

Benchmark:

vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct  \
    --dataset-name sonnet \
    --dataset-path vllm/benchmarks/sonnet.txt \
    --sonnet-input-len 1000 \
    --sonnet-output-len 200 \
    --port 8000 \
    --num-prompts 20 \
    --max-concurrency 1

Accuracy

To ensure there is no accidental accuracy degradation we also run the following for Flashinfer & Triton with kv_cache_dtype in {auto,fp8} both on this PR and on mainline. We also run without enforce_eager=True for the FP8 variants

lm_eval \
  --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=auto,tensor_parallel_size=1,enforce_eager=True \
  --tasks gsm8k \
  --batch_size 

Test Results

PR + FlashInfer
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  16.81     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.19      
Output token throughput (tok/s):         237.97    
Peak output token throughput (tok/s):    240.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1323.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          20.13     
Median TTFT (ms):                        19.97     
P99 TTFT (ms):                           23.20     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.12      
Median TPOT (ms):                        4.13      
P99 TPOT (ms):                           4.15      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.12      
Median ITL (ms):                         4.12      
P99 ITL (ms):                            4.61      
==================================================

PR + Triton
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  17.23     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.16      
Output token throughput (tok/s):         232.19    
Peak output token throughput (tok/s):    235.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1291.45   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.86     
Median TTFT (ms):                        24.77     
P99 TTFT (ms):                           31.37     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.20      
Median TPOT (ms):                        4.20      
P99 TPOT (ms):                           4.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.20      
Median ITL (ms):                         4.20      
P99 ITL (ms):                            4.58      
==================================================

Main + FlashInfer
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  18.15     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.10      
Output token throughput (tok/s):         220.42    
Peak output token throughput (tok/s):    223.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1225.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.18     
Median TTFT (ms):                        24.08     
P99 TTFT (ms):                           29.89     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.44      
Median TPOT (ms):                        4.44      
P99 TPOT (ms):                           4.46      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.44      
Median ITL (ms):                         4.43      
P99 ITL (ms):                            5.01      
==================================================

Main + Triton
============ Serving Benchmark Result ============
Successful requests:                     20        
Maximum request concurrency:             1         
Benchmark duration (s):                  17.42     
Total input tokens:                      18248     
Total generated tokens:                  4000      
Request throughput (req/s):              1.15      
Output token throughput (tok/s):         229.59    
Peak output token throughput (tok/s):    234.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          1276.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          27.61     
Median TTFT (ms):                        26.97     
P99 TTFT (ms):                           37.66     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.24      
Median TPOT (ms):                        4.24      
P99 TPOT (ms):                           4.32      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.24      
Median ITL (ms):                         4.23      
P99 ITL (ms):                            4.94      
==================================================

Accuracy on GSM8k

This PR shows no accuracy variation compared to mainline vllm.

PR + Triton + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7817|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|

PR + FlashInfer + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7801|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7551|±  |0.0118|


Main + FlashInfer + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7854|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7635|±  |0.0118|


Main + Triton + auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7817|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|


PR + Triton + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7771|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0119|

PR + FlashInfer + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7665|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7483|±  |0.0120|

Main + FlashInfer + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7665|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7483|±  |0.0118|


Main + Triton + fp8 (enforce-eager)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7771|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0118|


PR + Triton + fp8 (w/ compile)
Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7460|±  |0.0120|


PR + FlashInfer + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7756|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|

Main + FlashInfer + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7680|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7437|±  |0.0118|


Main + Triton + fp8 (w/ compile)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7672|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7468|±  |0.0118|


@mergify
Copy link

mergify bot commented Oct 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adabeyta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the query quantization logic for the Flashinfer and Triton attention backends, moving it from the backend implementation to the higher-level attention layer. This is a positive change for code structure and enables potential compiler fusions. While the changes for the Flashinfer backend appear correct, the removal of a critical assertion for the Triton backend is concerning. This assertion enforced that the query quantization scale must be 1.0, a limitation of the Triton kernel. Its removal could lead to silent correctness issues if not handled in the new quantization logic. I have added a critical review comment to highlight this potential issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: adabeyta <aabeyta@redhat.com>
@mergify
Copy link

mergify bot commented Oct 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adabeyta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…input dynamic for FlashInfer

Signed-off-by: adabeyta <aabeyta@redhat.com>
@@ -157,6 +144,11 @@ def trtllm_prefill_attn_kvfp8_dequant(
class FlashInferBackend(AttentionBackend):
accept_output_buffer: bool = True

@property
def supports_quant_query_input(self) -> bool:
return supports_trtllm_attention(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to rebase or merge main and resolve the import issue

Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
@mergify mergify bot removed the needs-rebase label Oct 12, 2025
# which causes decoding overheads
assert self.kv_cache_dtype in {"fp8", "fp8_e4m3"}
query, _ = self.query_quant(query, self._q_scale)
if not hasattr(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work; attention metadata is not set during the profile run when we compile. Instead, we should have a more robust way of checking, likely by calling supports_quant_query_input on the AttentionImpl object

query = query.reshape((num_tokens, num_heads, head_size))
"A non 1.0 q_scale is not currently supported.")

# Query quantization is now handled in the attention layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this comment, just remove

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
@pavanimajety
Copy link
Collaborator

@adabeyta Any analysis on why we are seeing lower toks/sec with enhanced fusion? Even without a custom kernel, the fact that rope + Quant can be jitted to a triton kernel should give us slightly higher perf, correct?

@ProExpertProg
Copy link
Collaborator

@adabeyta test failure looks related, it's possible this change breaks the fusion test. Could you run locally to check? Also worth running a model E2E to make sure fusion happens E2E (e2e tests coming soon in #24604)

@ProExpertProg ProExpertProg enabled auto-merge (squash) October 15, 2025 17:23
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now we actually lost performance with this; we should make sure we gain and not lose performance.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, those are some insane numbers... good work!

@ProExpertProg ProExpertProg merged commit 0a9ef0c into vllm-project:main Oct 15, 2025
51 checks passed
@adabeyta
Copy link
Contributor Author

@adabeyta Any analysis on why we are seeing lower toks/sec with enhanced fusion? Even without a custom kernel, the fact that rope + Quant can be jitted to a triton kernel should give us slightly higher perf, correct?

@pavanimajety Updated with new perf numbers. We're seeing better
performance across both Triton and FlashInfer backends (up to 8% throughput
improvements). The earlier regression was from an intermediate commit
before the gating logic was added.

@pavanimajety
Copy link
Collaborator

Great work, thanks for the update!

mandy-li pushed a commit to mandy-li/vllm that referenced this pull request Oct 16, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#26534)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants