-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[build] fix: only build FA3 for Hopper #28205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: AlpinDale <alpindale@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to restrict the build of FlashAttention 3 (FA3) to only Hopper architectures to reduce build times and wheel size. It does this by updating the flash-attention dependency to a fork that contains the necessary CMake changes and by modifying the build conditions in setup.py.
My review identifies two issues in setup.py:
- A new helper function
_has_sm90_or_higheruses a broadexcept Exception: pass, which can hide errors. I've suggested adding logging. - The conditional logic for building the FA3 extension is flawed. It uses
orinstead ofand, which makes the condition more permissive instead of restrictive, and could lead to build failures. I've provided a corrected version of the logic.
These changes are important for the stability and correctness of the build process.
Signed-off-by: AlpinDale <alpindale@gmail.com>
Signed-off-by: AlpinDale <alpindale@gmail.com>
| GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git | ||
| GIT_TAG a893712401d70362fbb299cd9c4b3476e8e9ed54 | ||
| # temporary until vllm-project/flash-attention#105 is merged | ||
| GIT_REPOSITORY https://github.com/AlpinDale/flash-attention.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should wait until the above PR is merged and use the flash-attention in vllm-project
|
This pull request has merge conflicts that must be resolved before it can be |
|
Apologies for the delay; there was offline discussions about deprecating FA2 and using Ampere for FA3; this seems far enough out that it makes sense to deprecate this for now FA side PR merged |
| return version | ||
|
|
||
|
|
||
| def _has_sm90_or_higher() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson thoughts on excluding Blackwell from FA3 builds as well? From what I understand, Blackwell still uses FA2 (at least until FA4 support lands).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this _has_sm90_or_higher is a bit messy; sorry should have reviewed this closer.
We should either do:
if not arch_listusetorch.cuda.get_arch_list()- handle this at the CMake level like FlashMLA and just create empty targets when no kernels would get built:
vllm/cmake/external_projects/flashmla.cmake
Lines 128 to 132 in 60e089f
else() # Create empty targets for setup.py when not targeting sm90a systems add_custom_target(_flashmla_C) add_custom_target(_flashmla_extension_C) endif()
preference for #2 in which case we wouldnt need to filter for Blackwell here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I'm not a fan of the current approach either; I'll change it to follow FlashMLA's example later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! 🙏
Purpose
FA3 is only used for Hopper, yet it's built for every Ampere (sm_80, sm_87, sm_89) and Ada (sm_89) arches. This needlessly increases build times, and inflates the wheel size. FA3 technically supports A100 (sm_80), but as discussed offline with @mgoin, it's not stable and vLLM does not actually enable it.
vLLM FA PR: vllm-project/flash-attention#105
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.