-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692
Conversation
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug where using the FLASHINFER_MLA backend without specifying a block size would cause an error. The changes ensure that a supported block size (64) is automatically selected, similar to how other MLA backends are handled. The changes in check_and_update_config are correct and directly address the issue. The logic for auto-selecting the FLASHINFER_MLA backend in get_attn_backend_cls is also a good addition. I have one suggestion to improve the future-proofing for the auto-selection logic. Overall, this is a good fix.
| use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or ( | ||
| selected_backend is None and cls.is_device_capability(100) | ||
| and block_size in [32, 64]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of cls.is_device_capability(100) for auto-selecting the FlashInfer MLA backend is too restrictive. It will only match for devices with exactly compute capability 10.0 (Blackwell), and will not automatically select this backend for future architectures with higher compute capabilities (e.g., > 10.0).
The corresponding test for this kernel (tests/kernels/attention/test_flashinfer_mla_decode.py) uses current_platform.has_device_capability(100), which suggests the kernel is expected to work on compute capabilities 10.0 and above.
To ensure future compatibility and correct auto-selection on upcoming hardware, cls.has_device_capability(100) should be used instead. This will match devices with compute capability 10.0 or greater.
A similar issue exists for the cutlass_mla backend logic, which you may want to address in a separate change for consistency.
| use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or ( | |
| selected_backend is None and cls.is_device_capability(100) | |
| and block_size in [32, 64]) | |
| use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or ( | |
| selected_backend is None and cls.has_device_capability(100) | |
| and block_size in [32, 64]) |
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks!
…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Before, specifying the
FLASHINFER_MLAbackend without a specified block size would lead to error. Block size would default to 16, and the backend only supports 32 or 64. This PR fixes it by overriding in a manner similar to theCUTLASS_MLAbackendTest Plan
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=128 --output-len=128 --num-prompts=100 --kv-cache-dtype=auto
Test Result
(no error, block size set automatically to 64)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.