[Hardware][AMD] Enable FlexAttention backend on ROCm #26439

mawong-amd · 2025-10-08T19:34:00Z

Purpose

Enable FlexAttention on ROCm if it is specified e.g. via VLLM_ATTENTION_BACKEND=FLEX_ATTENTION. This makes progress towards batch invariant inference on AMD.

Test Plan

Comparing correctness of default attention backend vs FlexAttention
lm_eval --model local-completions --model_args model=meta-llama/Llama-3.1-8B,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,max_retries=5 --tasks gsm8k

Test Result

Default:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5011|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.5011|±  |0.0138|

FlexAttention:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4882|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4875|±  |0.0138|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

mergify bot added the rocm Related to AMD ROCm label Oct 8, 2025

Add FlexAttention backend on ROCm

f643ec2

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

mawong-amd force-pushed the mawong/enable_flex_attn branch from 3cd0c5d to f643ec2 Compare October 8, 2025 19:43

mawong-amd marked this pull request as ready for review October 8, 2025 19:43

DarkLight1337 approved these changes Oct 9, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 9, 2025 04:27

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

DarkLight1337 merged commit de253d6 into vllm-project:main Oct 9, 2025
46 checks passed

mawong-amd deleted the mawong/enable_flex_attn branch October 9, 2025 15:48

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

7fa668b

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

c47e1d6

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

vllmellm mentioned this pull request Oct 16, 2025

[Bug] [ROCm] [Flex Attention]: Unable to run #27007

Open

1 task

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

d2e0988

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

2236911

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

3a09d55

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

75bca44

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439

3f455e2

) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Hardware][AMD] Enable FlexAttention backend on ROCm #26439

[Hardware][AMD] Enable FlexAttention backend on ROCm #26439

Uh oh!

mawong-amd commented Oct 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Hardware][AMD] Enable FlexAttention backend on ROCm #26439

[Hardware][AMD] Enable FlexAttention backend on ROCm #26439

Uh oh!

Conversation

mawong-amd commented Oct 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mawong-amd commented Oct 8, 2025 •

edited by github-actions bot

Loading