-
-
Couldn't load subscription status.
- Fork 10.9k
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs #24577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables FP8 KV cache for the FlashInfer backend on non-sm100 GPUs. The changes correctly restrict FP8 query quantization to GPUs supporting TRT-LLM attention. However, the check for FP8 KV cache support for FlashInfer is too permissive and does not verify the GPU's compute capability, which could lead to runtime errors on unsupported hardware. I've provided a suggestion to add the necessary capability check.
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Validated locally on sm89 (L40s) for eval and perf on gsm8k. Perhaps we should default to flashinfer for non-sm90 if it is installed and fp8 kv cache is enabled.
lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
INFO 09-10 13:32:59 [cuda.py:417] Cannot use FlashAttention backend for FP8 KV cache.
INFO 09-10 13:32:59 [cuda.py:429] Using XFormers backend.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:36<00:00, 36.48it/s, est. speed input: 36216.29 toks/s, output: 4035.67 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.4033|± |0.0135|
| | |strict-match | 5|exact_match|↑ |0.4026|± |0.0135|
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:52<00:00, 25.06it/s, est. speed input: 24882.87 toks/s, output: 2823.17 toks/s]
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.4155|± |0.0136|
| | |strict-match | 5|exact_match|↑ |0.4170|± |0.0136|
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
(EngineCore_DP0 pid=1122864) INFO 09-10 13:34:44 [cuda.py:285] Using FlashInfer backend on V1 engine.
...
Processed prompts: 100%|███████████████████████████████| 1319/1319 [00:29<00:00, 44.01it/s, est. speed input: 43696.65 toks/s, output: 5009.87 toks/s]
vllm (pretrained=Qwen/Qwen3-0.6B,max_model_len=4096,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.4003|± |0.0135|
| | |strict-match | 5|exact_match|↑ |0.3988|± |0.0135|
|
Ampere does not support FP8 kv cache. only sm>=89 supported. |
|
@elvischenv Here are my results on A100 with FP8 KV Cache, it seems to work fine for kv cache compression. See the available concurrency doubles. |
|
@mgoin First to know that. I just think Ampere is not supported FP8 compute natively. Also there is a thread discussing that: https://discuss.vllm.ai/t/kv-cache-quantizing/749. |
|
@gau-nernst Thanks for reporting the issue. Only using |
|
It has been the case for a while that attention backends that only do FP8 kv cache storage, like xformers, are compatible for hardware without FP8 hardware because we use On the FlashInfer side, this has actually been implemented since the announcement of the project https://flashinfer.ai/2024/02/02/introduce-flashinfer.html. See this section with A100 numbers. |
|
Thanks @elvischenv, merging this for now as failures are unrelated |
…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…-sm100 GPUs (vllm-project#24577) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Purpose
FlashInfer supports FP8 KV-cache on GPUs that use FA2 backend e.g. sm80, sm89, sm120. Currently I only have access to sm120 GPU so I can only confirm that it works.
Triton attention backend also support FP8.
Changes
is_kv_cache_dtype_supported()returns TrueTest Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.