Skip to content

Conversation

@gau-nernst
Copy link
Contributor

@gau-nernst gau-nernst commented Sep 10, 2025

Purpose

FlashInfer supports FP8 KV-cache on GPUs that use FA2 backend e.g. sm80, sm89, sm120. Currently I only have access to sm120 GPU so I can only confirm that it works.

Triton attention backend also support FP8.

Changes

Test Plan

# for FlashInfer
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve Qwen/Qwen3-4B --kv-cache-dtype fp8

# for Triton
# block size requires to be at least 32 for FP8
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve Qwen/Qwen3-4B --kv-cache-dtype fp8 --block-size 32
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "user", "content": "Who are you"}
    ]
  }'

Test Result

{"id":"chatcmpl-b48c5ec42e2d476d80ca8c94e695428d","object":"chat.completion","created":1757499722,"model":"Qwen/Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user asked, \"Who are you?\" I need to respond in a friendly and informative way. Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and assisting with various tasks. It's important to highlight that I'm designed to be helpful and can adapt to different needs. Also, I should invite the user to ask questions or share what they need help with. Let me make sure the tone is approachable and not too technical. I'll check for any key points I might have missed, like the fact that I'm multilingual and can handle multiple tasks. Alright, that should cover it.\n</think>\n\nHello! I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create content, and assist with a variety of tasks. I am designed to be helpful and can adapt to different needs. Feel free to ask me anything or let me know what you need help with!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":221,"completion_tokens":210,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables FP8 KV cache for the FlashInfer backend on non-sm100 GPUs. The changes correctly restrict FP8 query quantization to GPUs supporting TRT-LLM attention. However, the check for FP8 KV cache support for FlashInfer is too permissive and does not verify the GPU's compute capability, which could lead to runtime errors on unsupported hardware. I've provided a suggestion to add the necessary capability check.

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
@gau-nernst gau-nernst changed the title [Bugfix] Enable FP8 KV cache for FlashInfer backend on non-sm100 GPUs [Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs Sep 10, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Validated locally on sm89 (L40s) for eval and perf on gsm8k. Perhaps we should default to flashinfer for non-sm90 if it is installed and fp8 kv cache is enabled.

lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
INFO 09-10 13:32:59 [cuda.py:417] Cannot use FlashAttention backend for FP8 KV cache.
INFO 09-10 13:32:59 [cuda.py:429] Using XFormers backend.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:36<00:00, 36.48it/s, est. speed input: 36216.29 toks/s, output: 4035.67 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4033|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4026|±  |0.0135|


VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:52<00:00, 25.06it/s, est. speed input: 24882.87 toks/s, output: 2823.17 toks/s]
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4155|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.4170|±  |0.0136|


VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:29<00:00, 44.01it/s, est. speed input: 43696.65 toks/s, output: 5009.87 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4003|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3988|±  |0.0135|

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@elvischenv
Copy link
Contributor

Ampere does not support FP8 kv cache. only sm>=89 supported.

@mgoin
Copy link
Member

mgoin commented Sep 10, 2025

@elvischenv Here are my results on A100 with FP8 KV Cache, it seems to work fine for kv cache compression. See the available concurrency doubles.

# FlashInfer BF16
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:13 [cuda.py:285] Using FlashInfer backend on V1 engine.
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:14 [gpu_model_runner.py:2235] Model loading took 1.1201 GiB and 0.596228 seconds
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:24 [gpu_worker.py:276] Available KV cache memory: 69.66 GiB
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:25 [kv_cache_utils.py:864] GPU KV cache size: 652,144 tokens
(EngineCore_DP0 pid=643458) INFO 09-10 13:29:25 [kv_cache_utils.py:868] Maximum concurrency for 4,096 tokens per request: 159.21x
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:40<00:00, 32.25it/s, est. speed input: 32015.64 toks/s, output: 3652.00 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4155|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.4177|±  |0.0136|

# FlashInfer FP8 KV Cache
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:33 [cuda.py:285] Using FlashInfer backend on V1 engine.
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:34 [gpu_model_runner.py:2235] Model loading took 1.1201 GiB and 0.605931 seconds
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [gpu_worker.py:276] Available KV cache memory: 69.66 GiB
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [kv_cache_utils.py:864] GPU KV cache size: 1,304,288 tokens
(EngineCore_DP0 pid=640673) INFO 09-10 13:27:43 [kv_cache_utils.py:868] Maximum concurrency for 4,096 tokens per request: 318.43x
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:41<00:00, 31.57it/s, est. speed input: 31342.04 toks/s, output: 3603.65 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4018|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4102|±  |0.0135|

@elvischenv
Copy link
Contributor

@mgoin First to know that. I just think Ampere is not supported FP8 compute natively. Also there is a thread discussing that: https://discuss.vllm.ai/t/kv-cache-quantizing/749.

@elvischenv
Copy link
Contributor

elvischenv commented Sep 10, 2025

@gau-nernst Thanks for reporting the issue.
For the support of query quantization, I have created a more general fix #24600.
It will first try FP8 query, and then will reset query type to model dtype if it found the TRTLLM attn is not picked.

Only using supports_trtllm_attention may not enough for determining the q dtype.

@mgoin
Copy link
Member

mgoin commented Sep 10, 2025

It has been the case for a while that attention backends that only do FP8 kv cache storage, like xformers, are compatible for hardware without FP8 hardware because we use __nv_fp8_storage_t which uses uint8 as a storage container.

On the FlashInfer side, this has actually been implemented since the announcement of the project https://flashinfer.ai/2024/02/02/introduce-flashinfer.html. See this section with A100 numbers.
Screenshot 2025-09-10 at 2 08 11 PM

@mgoin
Copy link
Member

mgoin commented Sep 10, 2025

Thanks @elvischenv, merging this for now as failures are unrelated

@simon-mo simon-mo merged commit a0933c3 into vllm-project:main Sep 10, 2025
47 of 50 checks passed
@gau-nernst gau-nernst deleted the thien/flashinfer_fp8_kv branch September 11, 2025 00:01
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
…-sm100 GPUs (vllm-project#24577)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…-sm100 GPUs (vllm-project#24577)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…-sm100 GPUs (vllm-project#24577)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…-sm100 GPUs (vllm-project#24577)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants