Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Sep 11, 2025

Purpose

Before, specifying the FLASHINFER_MLA backend without a specified block size would lead to error. Block size would default to 16, and the backend only supports 32 or 64. This PR fixes it by overriding in a manner similar to the CUTLASS_MLA backend

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=128 --output-len=128 --num-prompts=100 --kv-cache-dtype=auto

Test Result

(no error, block size set automatically to 64)

Throughput: 28.96 requests/s, 7411.13 total tokens/s, 3706.87 output tokens/s
Total num prompt tokens:  12791
Total num output tokens:  12800

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug where using the FLASHINFER_MLA backend without specifying a block size would cause an error. The changes ensure that a supported block size (64) is automatically selected, similar to how other MLA backends are handled. The changes in check_and_update_config are correct and directly address the issue. The logic for auto-selecting the FLASHINFER_MLA backend in get_attn_backend_cls is also a good addition. I have one suggestion to improve the future-proofing for the auto-selection logic. Overall, this is a good fix.

Comment on lines +239 to +241
use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (
selected_backend is None and cls.is_device_capability(100)
and block_size in [32, 64])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of cls.is_device_capability(100) for auto-selecting the FlashInfer MLA backend is too restrictive. It will only match for devices with exactly compute capability 10.0 (Blackwell), and will not automatically select this backend for future architectures with higher compute capabilities (e.g., > 10.0).

The corresponding test for this kernel (tests/kernels/attention/test_flashinfer_mla_decode.py) uses current_platform.has_device_capability(100), which suggests the kernel is expected to work on compute capabilities 10.0 and above.

To ensure future compatibility and correct auto-selection on upcoming hardware, cls.has_device_capability(100) should be used instead. This will match devices with compute capability 10.0 or greater.

A similar issue exists for the cutlass_mla backend logic, which you may want to address in a separate change for consistency.

Suggested change
use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (
selected_backend is None and cls.is_device_capability(100)
and block_size in [32, 64])
use_flashinfermla = selected_backend == _Backend.FLASHINFER_MLA or (
selected_backend is None and cls.has_device_capability(100)
and block_size in [32, 64])

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) September 11, 2025 20:23
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025
@LucasWilkinson LucasWilkinson merged commit d4fd276 into vllm-project:main Sep 11, 2025
51 checks passed
@MatthewBonanni MatthewBonanni deleted the fix_fi_block_size branch September 11, 2025 23:37
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
…#24692)

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
…#24692)

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…#24692)

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…#24692)

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…#24692)

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants