[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692

MatthewBonanni · 2025-09-11T20:21:53Z

Purpose

Before, specifying the FLASHINFER_MLA backend without a specified block size would lead to error. Block size would default to 16, and the backend only supports 32 or 64. This PR fixes it by overriding in a manner similar to the CUTLASS_MLA backend

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=128 --output-len=128 --num-prompts=100 --kv-cache-dtype=auto

Test Result

(no error, block size set automatically to 64)

Throughput: 28.96 requests/s, 7411.13 total tokens/s, 3706.87 output tokens/s
Total num prompt tokens:  12791
Total num output tokens:  12800

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

gemini-code-assist

Code Review

This pull request correctly fixes a bug where using the FLASHINFER_MLA backend without specifying a block size would cause an error. The changes ensure that a supported block size (64) is automatically selected, similar to how other MLA backends are handled. The changes in check_and_update_config are correct and directly address the issue. The logic for auto-selecting the FLASHINFER_MLA backend in get_attn_backend_cls is also a good addition. I have one suggestion to improve the future-proofing for the auto-selection logic. Overall, this is a good fix.

gemini-code-assist · 2025-09-11T20:23:33Z

vllm/platforms/cuda.py

+            use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (
+                selected_backend is None and cls.is_device_capability(100)
+                and block_size in [32, 64])


The use of cls.is_device_capability(100) for auto-selecting the FlashInfer MLA backend is too restrictive. It will only match for devices with exactly compute capability 10.0 (Blackwell), and will not automatically select this backend for future architectures with higher compute capabilities (e.g., > 10.0).

The corresponding test for this kernel (tests/kernels/attention/test_flashinfer_mla_decode.py) uses current_platform.has_device_capability(100), which suggests the kernel is expected to work on compute capabilities 10.0 and above.

To ensure future compatibility and correct auto-selection on upcoming hardware, cls.has_device_capability(100) should be used instead. This will match devices with compute capability 10.0 or greater.

A similar issue exists for the cutlass_mla backend logic, which you may want to address in a separate change for consistency.

Suggested change

use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (

selected_backend is None and cls.is_device_capability(100)

and block_size in [32, 64])

use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (

selected_backend is None and cls.has_device_capability(100)

and block_size in [32, 64])

LucasWilkinson

LGTM; thanks!

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Fix block size logic

f8c7029

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

LucasWilkinson approved these changes Sep 11, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) September 11, 2025 20:23

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025

LucasWilkinson merged commit d4fd276 into vllm-project:main Sep 11, 2025
51 checks passed

MatthewBonanni deleted the fix_fi_block_size branch September 11, 2025 23:37

MatthewBonanni mentioned this pull request Sep 12, 2025

Add FLASHINFER_MLA to backend selector test #24753

Merged

5 tasks

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Bugfix][Attention] Fix FlashInfer MLA block size logic (vllm-project…

bef7e1e

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[Bugfix][Attention] Fix FlashInfer MLA block size logic (vllm-project…

fbfdea6

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix][Attention] Fix FlashInfer MLA block size logic (vllm-project…

a08f8b3

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Bugfix][Attention] Fix FlashInfer MLA block size logic (vllm-project…

531b461

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Bugfix][Attention] Fix FlashInfer MLA block size logic (vllm-project…

161f01c

…#24692) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692

[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692

Uh oh!

MatthewBonanni commented Sep 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 11, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692

[Bugfix][Attention] Fix FlashInfer MLA block size logic #24692

Uh oh!

Conversation

MatthewBonanni commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatthewBonanni commented Sep 11, 2025 •

edited by github-actions bot

Loading