Skip to content

Conversation

@AlpinDale
Copy link
Contributor

@AlpinDale AlpinDale commented Nov 6, 2025

Purpose

FA3 is only used for Hopper, yet it's built for every Ampere (sm_80, sm_87, sm_89) and Ada (sm_89) arches. This needlessly increases build times, and inflates the wheel size. FA3 technically supports A100 (sm_80), but as discussed offline with @mgoin, it's not stable and vLLM does not actually enable it.

vLLM FA PR: vllm-project/flash-attention#105


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: AlpinDale <alpindale@gmail.com>
@mergify mergify bot added the ci/build label Nov 6, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to restrict the build of FlashAttention 3 (FA3) to only Hopper architectures to reduce build times and wheel size. It does this by updating the flash-attention dependency to a fork that contains the necessary CMake changes and by modifying the build conditions in setup.py.

My review identifies two issues in setup.py:

  1. A new helper function _has_sm90_or_higher uses a broad except Exception: pass, which can hide errors. I've suggested adding logging.
  2. The conditional logic for building the FA3 extension is flawed. It uses or instead of and, which makes the condition more permissive instead of restrictive, and could lead to build failures. I've provided a corrected version of the logic.

These changes are important for the stability and correctness of the build process.

Signed-off-by: AlpinDale <alpindale@gmail.com>
Signed-off-by: AlpinDale <alpindale@gmail.com>
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG a893712401d70362fbb299cd9c4b3476e8e9ed54
# temporary until vllm-project/flash-attention#105 is merged
GIT_REPOSITORY https://github.com/AlpinDale/flash-attention.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should wait until the above PR is merged and use the flash-attention in vllm-project

@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AlpinDale.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2025
@LucasWilkinson
Copy link
Collaborator

Apologies for the delay; there was offline discussions about deprecating FA2 and using Ampere for FA3; this seems far enough out that it makes sense to deprecate this for now

FA side PR merged

return version


def _has_sm90_or_higher() -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasWilkinson thoughts on excluding Blackwell from FA3 builds as well? From what I understand, Blackwell still uses FA2 (at least until FA4 support lands).

Copy link
Collaborator

@LucasWilkinson LucasWilkinson Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this _has_sm90_or_higher is a bit messy; sorry should have reviewed this closer.

We should either do:

  1. if not arch_list use torch.cuda.get_arch_list()
  2. handle this at the CMake level like FlashMLA and just create empty targets when no kernels would get built:
    else()
    # Create empty targets for setup.py when not targeting sm90a systems
    add_custom_target(_flashmla_C)
    add_custom_target(_flashmla_extension_C)
    endif()

preference for #2 in which case we wouldnt need to filter for Blackwell here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I'm not a fan of the current approach either; I'll change it to follow FlashMLA's example later today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants