-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic #24600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic #24600
Conversation
3edef5b to
22676bc
Compare
cc549c2 to
33c052a
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
33c052a to
d59fe83
Compare
d59fe83 to
a3055cb
Compare
|
cc @ProExpertProg @mgoin @pavanimajety for review, thanks! |
1c1f8d9 to
5a695fa
Compare
5a695fa to
f4a1426
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just one nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, two more notes. Does this mean that when has_sinks is enabled, attention+quant fusion won't work?
|
|
||
|
|
||
| @functools.cache | ||
| def force_use_trtllm_attention() -> Optional[bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't read envs in cached functions, please separate into force_use_trtllm_attention (uncached) and _force_use_trtllm_attention (cached, takes env var as input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comments above.
Does this mean that when has_sinks is enabled, attention+quant fusion won't work?
- If kv=auto, all the things work fine, will use TRTLLM BF16-qkv BF16-out kernel
- If kv=fp8, by default it will always quantize query and use TRTLLM FP8-qkv
- In this case, we found some accuracy issues in the FP8-qkv BF16-out sinks kernel. That has been fixed in the TRTLLM upstream and need to be propagated to Flahsinfer and vLLM. For now just raise an error and suggest user to use kv=auto.
- There is a WAR for using BF16-q FP8-kv BF16-out kernel, by setting
VLLM_FLASHINFER_DISABLE_Q_QUANTIZATION, introducing by [flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention #24197.
Back to attention+quant fusion, AFAIR, only gpt-oss need attn sinks, and we haven't started quantizing the attn output for it. If we want to enable attention+quant fusion for gpt-oss, we have to at least
- Quantize the gpt-oss model.
- Ensure the FP8-qkv FP8/NVFP4-out attn sinks kernels work and have good accuracy on gpt-oss.
So for now, we need to fix FP8-qkv BF16-out sinks kernel first, and then verify the FP8-qkv FP8/NVFP4-out kernel if we need them.
f4a1426 to
0f48027
Compare
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
0f48027 to
8ec62d7
Compare
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…llm-project#24600) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
#23647 always quantizes query if kv cache type is set to FP8, and will use TRTLLM attention kernel. However, there are lots of cases that do not support TRTLLM attention.
VLLM_USE_TRTLLM_ATTENTIONshould determine the "use" path instead of the "support" path.--max-model-lenis not specified.Test Plan && Test Result
SM100, kv_cache_dtype=fp8
SM100, kv_cache_dtype=fp8, VLLM_USE_TRTLLM_ATTENTION=0
SM120, kv_cache_dtype=fp8, VLLM_ATTENTION_BACKEND=FLASHINFER
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.