-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Fix cuda_archs_loose_intersection when handling sm_*a #20207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix cuda_archs_loose_intersection when handling sm_*a #20207
Conversation
Signed-off-by: Huy Do <huydhn@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @huydhn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical bug in the CMake build configuration that prevented vLLM from correctly compiling kernels for multiple sm_*a
CUDA architectures simultaneously. The fix ensures that all specified sm_*a
architectures are properly recognized and included in the build process, thereby resolving runtime errors related to missing kernel images on devices like H100. Additionally, it improves the verbosity and helpfulness of build-time messages for the moe_data
component.
Highlights
- Core Bug Fix:
cuda_archs_loose_intersection
forsm_*a
Architectures: The primary issue wheresm_*a
CUDA architectures (e.g.,9.0a
for Hopper,10.0a
for Blackwell) were being incorrectly handled in thecuda_archs_loose_intersection
CMake function has been resolved. Previously, the logic would overwrite the list of target architectures, leading to only onesm_*a
variant being processed. The updated code now correctly iterates through and appends all relevantsm_*a
architectures, ensuring proper kernel compilation for multi-architecture setups and resolvingCUDA error: no kernel image is available for execution on the device
errors. - Enhanced Build Feedback for
moe_data
: The build system now provides clearer status messages regarding the compilation ofmoe_data
. A new message explicitly states which CUDA architecturesmoe_data
is being built for. Additionally, improved messages are displayed whenmoe_data
is not built, including a specific recommendation to upgrade the CUDA compiler to version 12.3 or later for optimal support of FP8 quantized MoE models on Hopper or Blackwell.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code changes address a bug in the cuda_archs_loose_intersection
function that caused incorrect CUDA architecture selection. The fix involves generalizing the architecture matching logic and adding more informative build messages.
Signed-off-by: Huy Do <huydhn@gmail.com>
list(REMOVE_ITEM _TGT_CUDA_ARCHS "10.0") | ||
set(_CUDA_ARCHS "10.0a") | ||
foreach(_arch ${_SRC_CUDA_ARCHS}) | ||
if(_arch MATCHES "\\a$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we explicitly handle 9.0a or 10.0a? If we have something new like 11.0a, do we expect to have this logic to handle them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cmake output looks correct when I throw in 12.0a
, printing-- Building moe_data for archs: 9.0a;10.0a;12.0a
. I couldn't find a reason on why 9.0a
and 10.0a
needs to be handled explicitly here, and a similar cmake loop logic has already been implemented to handle +PTX
suffix. So, having a loop to handle all *a
cases looks like the right choice. Let's wait for more context from folks I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the loop is the right solution 👍 this function was originally written when the only "a" variant was 9.0a
and we were naively hopefully the the lack of forward compatibility was a temporary one time hopper thing; looks like this is the direction Nvidia is going though 😞 so we should just embrace it haha. I think the 10.0a was just kinda tacked on so good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Okay, after read the code again, I think it makes sense to use a loop with regex.
Is it because the old code called set(_CUDA_ARCHS "10.0a")
instead of list(APPEND _CUDA_ARCHS "10.0a")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; good catch! Thanks for the fix!
Side note (FYI @mgoin): we should probably restrict the use of ill open a PR |
After #20086 lands, I start to see
CUDA error: no kernel image is available for execution on the device
error again formeta-llama/llama-4-maverick-17b-128e-instruct-fp8
. The error surfaces on the periodic H100 benchmark run I set up on https://github.com/pytorch/pytorch-integration-testing/actions/runs/15941259223/job/44969615173#step:14:1508It turns out that there is another bug in
cuda_archs_loose_intersection
function in which it wrongly callsset(_CUDA_ARCHS "*a")
consecutively for9.0a
and10.0a
, effectively overwriting the former with the later. This issue only happens forsm_*a
when there are more than one of them inTORCH_CUDA_ARCH_LIST
, i.e.TORCH_CUDA_ARCH_LIST='7.5;8.0;8.6;9.0a;10.0a'
. I think this finally explains why this kernel either works for Hopper or Blackwell, but not both.Testing
Building vLLM locally finally show the correct
CUDA_ARCH
being selected. Previously, it was either10.0a
or9.0a
if I excluded Blackwell from the list.CI build log also looks correct https://buildkite.com/vllm/fastcheck/builds/28660#0197b5d8-a37f-4e28-9953-80f63a89a0de/127-5344
cc @mgoin