[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577

gau-nernst · 2025-09-10T10:55:50Z

Purpose

FlashInfer supports FP8 KV-cache on GPUs that use FA2 backend e.g. sm80, sm89, sm120. Currently I only have access to sm120 GPU so I can only confirm that it works.

Triton attention backend also support FP8.

Changes

When attention backend is FlashInfer or Triton, is_kv_cache_dtype_supported() returns True
In FlashInfer attention, only enable FP8 query when TRT-LLM attention is supported on the current GPU - see [Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel #23647 (comment)
Triton attention is only enabled if the GPU supports FP8, because I believe the triton kernel always uses FP8 query

Test Plan

# for FlashInfer
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve Qwen/Qwen3-4B --kv-cache-dtype fp8

# for Triton
# block size requires to be at least 32 for FP8
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve Qwen/Qwen3-4B --kv-cache-dtype fp8 --block-size 32

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "user", "content": "Who are you"}
    ]
  }'

Test Result

{"id":"chatcmpl-b48c5ec42e2d476d80ca8c94e695428d","object":"chat.completion","created":1757499722,"model":"Qwen/Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user asked, \"Who are you?\" I need to respond in a friendly and informative way. Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and assisting with various tasks. It's important to highlight that I'm designed to be helpful and can adapt to different needs. Also, I should invite the user to ask questions or share what they need help with. Let me make sure the tone is approachable and not too technical. I'll check for any key points I might have missed, like the fact that I'm multilingual and can handle multiple tasks. Alright, that should cover it.\n</think>\n\nHello! I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create content, and assist with a variety of tasks. I am designed to be helpful and can adapt to different needs. Feel free to ask me anything or let me know what you need help with!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":221,"completion_tokens":210,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

gemini-code-assist

Code Review

This pull request enables FP8 KV cache for the FlashInfer backend on non-sm100 GPUs. The changes correctly restrict FP8 query quantization to GPUs supporting TRT-LLM attention. However, the check for FP8 KV cache support for FlashInfer is too permissive and does not verify the GPU's compute capability, which could lead to runtime errors on unsupported hardware. I've provided a suggestion to add the necessary capability check.

vllm/platforms/cuda.py

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

mgoin

LGTM, thanks! Validated locally on sm89 (L40s) for eval and perf on gsm8k. Perhaps we should default to flashinfer for non-sm90 if it is installed and fp8 kv cache is enabled.

lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
INFO 09-10 13:32:59 [cuda.py:417] Cannot use FlashAttention backend for FP8 KV cache.
INFO 09-10 13:32:59 [cuda.py:429] Using XFormers backend.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:36<00:00, 36.48it/s, est. speed input: 36216.29 toks/s, output: 4035.67 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4033|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4026|±  |0.0135|


VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:52<00:00, 25.06it/s, est. speed input: 24882.87 toks/s, output: 2823.17 toks/s]
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4155|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.4170|±  |0.0136|


VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:29<00:00, 44.01it/s, est. speed input: 43696.65 toks/s, output: 5009.87 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4003|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3988|±  |0.0135|

elvischenv · 2025-09-10T16:20:11Z

Ampere does not support FP8 kv cache. only sm>=89 supported.

mgoin · 2025-09-10T17:32:48Z

@elvischenv Here are my results on A100 with FP8 KV Cache, it seems to work fine for kv cache compression. See the available concurrency doubles.

# FlashInfer BF16
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:13 [cuda.py:285] Using FlashInfer backend on V1 engine.
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:14 [gpu_model_runner.py:2235] Model loading took 1.1201 GiB and 0.596228 seconds
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:24 [gpu_worker.py:276] Available KV cache memory: 69.66 GiB
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:25 [kv_cache_utils.py:864] GPU KV cache size: 652,144 tokens
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:25 [kv_cache_utils.py:868] Maximum concurrency for 4,096 tokens per request: 159.21x
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:40<00:00, 32.25it/s, est. speed input: 32015.64 toks/s, output: 3652.00 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4155|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.4177|±  |0.0136|

# FlashInfer FP8 KV Cache
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:33 [cuda.py:285] Using FlashInfer backend on V1 engine.
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:34 [gpu_model_runner.py:2235] Model loading took 1.1201 GiB and 0.605931 seconds
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [gpu_worker.py:276] Available KV cache memory: 69.66 GiB
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [kv_cache_utils.py:864] GPU KV cache size: 1,304,288 tokens
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [kv_cache_utils.py:868] Maximum concurrency for 4,096 tokens per request: 318.43x
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:41<00:00, 31.57it/s, est. speed input: 31342.04 toks/s, output: 3603.65 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4018|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4102|±  |0.0135|

elvischenv · 2025-09-10T17:47:38Z

@mgoin First to know that. I just think Ampere is not supported FP8 compute natively. Also there is a thread discussing that: https://discuss.vllm.ai/t/kv-cache-quantizing/749.

elvischenv · 2025-09-10T18:07:32Z

@gau-nernst Thanks for reporting the issue.
For the support of query quantization, I have created a more general fix #24600.
It will first try FP8 query, and then will reset query type to model dtype if it found the TRTLLM attn is not picked.

Only using supports_trtllm_attention may not enough for determining the q dtype.

mgoin · 2025-09-10T18:08:59Z

It has been the case for a while that attention backends that only do FP8 kv cache storage, like xformers, are compatible for hardware without FP8 hardware because we use __nv_fp8_storage_t which uses uint8 as a storage container.

On the FlashInfer side, this has actually been implemented since the announcement of the project https://flashinfer.ai/2024/02/02/introduce-flashinfer.html. See this section with A100 numbers.

mgoin · 2025-09-10T19:33:26Z

Thanks @elvischenv, merging this for now as failures are unrelated

…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

enable fp8 kv flashinfer. fix fa2

c35b6cb

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

gau-nernst requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 10, 2025 10:55

mergify bot added the v1 label Sep 10, 2025

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

vllm/platforms/cuda.py Show resolved Hide resolved

gau-nernst added 2 commits September 10, 2025 12:50

enable triton attn

2acebb7

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

fix pre-commit

0ad66ca

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

gau-nernst changed the title ~~[Bugfix] Enable FP8 KV cache for FlashInfer backend on non-sm100 GPUs~~ [Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs Sep 10, 2025

mgoin approved these changes Sep 10, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

elvischenv mentioned this pull request Sep 10, 2025

[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic #24600

Merged

5 tasks

simon-mo merged commit a0933c3 into vllm-project:main Sep 10, 2025
47 of 50 checks passed

gau-nernst deleted the thien/flashinfer_fp8_kv branch September 11, 2025 00:01

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non…

29ead89

…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non…

dc88ca5

…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577

Uh oh!

gau-nernst commented Sep 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin left a comment •

edited

Loading

Uh oh!

elvischenv commented Sep 10, 2025

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

elvischenv commented Sep 10, 2025

Uh oh!

elvischenv commented Sep 10, 2025 •

edited

Loading

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577

[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577

Uh oh!

Conversation

gau-nernst commented Sep 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elvischenv commented Sep 10, 2025

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

elvischenv commented Sep 10, 2025

Uh oh!

elvischenv commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

mgoin commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gau-nernst commented Sep 10, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading

elvischenv commented Sep 10, 2025 •

edited

Loading