Skip to content

Fix cuda_archs_loose_intersection when handling sm_*a #20207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Jun 28, 2025

After #20086 lands, I start to see CUDA error: no kernel image is available for execution on the device error again for meta-llama/llama-4-maverick-17b-128e-instruct-fp8. The error surfaces on the periodic H100 benchmark run I set up on https://github.com/pytorch/pytorch-integration-testing/actions/runs/15941259223/job/44969615173#step:14:1508

It turns out that there is another bug in cuda_archs_loose_intersection function in which it wrongly calls set(_CUDA_ARCHS "*a") consecutively for 9.0a and 10.0a, effectively overwriting the former with the later. This issue only happens for sm_*a when there are more than one of them in TORCH_CUDA_ARCH_LIST, i.e. TORCH_CUDA_ARCH_LIST='7.5;8.0;8.6;9.0a;10.0a'. I think this finally explains why this kernel either works for Hopper or Blackwell, but not both.

Testing

Building vLLM locally finally show the correct CUDA_ARCH being selected. Previously, it was either 10.0a or 9.0a if I excluded Blackwell from the list.

TORCH_CUDA_ARCH_LIST='7.5;8.0;8.6;9.0a;10.0a' VLLM_FA_CMAKE_GPU_ARCHES='80-real;90-real' python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38

...
-- Building moe_data for archs: 9.0a;10.0a <-- Show both
...

CI build log also looks correct https://buildkite.com/vllm/fastcheck/builds/28660#0197b5d8-a37f-4e28-9953-80f63a89a0de/127-5344

cc @mgoin

Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn requested a review from tlrmchlsmth as a code owner June 28, 2025 09:18
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @huydhn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the CMake build configuration that prevented vLLM from correctly compiling kernels for multiple sm_*a CUDA architectures simultaneously. The fix ensures that all specified sm_*a architectures are properly recognized and included in the build process, thereby resolving runtime errors related to missing kernel images on devices like H100. Additionally, it improves the verbosity and helpfulness of build-time messages for the moe_data component.

Highlights

  • Core Bug Fix: cuda_archs_loose_intersection for sm_*a Architectures: The primary issue where sm_*a CUDA architectures (e.g., 9.0a for Hopper, 10.0a for Blackwell) were being incorrectly handled in the cuda_archs_loose_intersection CMake function has been resolved. Previously, the logic would overwrite the list of target architectures, leading to only one sm_*a variant being processed. The updated code now correctly iterates through and appends all relevant sm_*a architectures, ensuring proper kernel compilation for multi-architecture setups and resolving CUDA error: no kernel image is available for execution on the device errors.
  • Enhanced Build Feedback for moe_data: The build system now provides clearer status messages regarding the compilation of moe_data. A new message explicitly states which CUDA architectures moe_data is being built for. Additionally, improved messages are displayed when moe_data is not built, including a specific recommendation to upgrade the CUDA compiler to version 12.3 or later for optimal support of FP8 quantized MoE models on Hopper or Blackwell.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the ci/build label Jun 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes address a bug in the cuda_archs_loose_intersection function that caused incorrect CUDA architecture selection. The fix involves generalizing the architecture matching logic and adding more informative build messages.

Signed-off-by: Huy Do <huydhn@gmail.com>
list(REMOVE_ITEM _TGT_CUDA_ARCHS "10.0")
set(_CUDA_ARCHS "10.0a")
foreach(_arch ${_SRC_CUDA_ARCHS})
if(_arch MATCHES "\\a$")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we explicitly handle 9.0a or 10.0a? If we have something new like 11.0a, do we expect to have this logic to handle them here?

Copy link
Contributor Author

@huydhn huydhn Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmake output looks correct when I throw in 12.0a, printing-- Building moe_data for archs: 9.0a;10.0a;12.0a. I couldn't find a reason on why 9.0a and 10.0a needs to be handled explicitly here, and a similar cmake loop logic has already been implemented to handle +PTX suffix. So, having a loop to handle all *a cases looks like the right choice. Let's wait for more context from folks I guess.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop is the right solution 👍 this function was originally written when the only "a" variant was 9.0a and we were naively hopefully the the lack of forward compatibility was a temporary one time hopper thing; looks like this is the direction Nvidia is going though 😞 so we should just embrace it haha. I think the 10.0a was just kinda tacked on so good catch!

@mgoin
Copy link
Member

mgoin commented Jun 28, 2025

Cc @LucasWilkinson @tlrmchlsmth

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Okay, after read the code again, I think it makes sense to use a loop with regex.

Is it because the old code called set(_CUDA_ARCHS "10.0a") instead of list(APPEND _CUDA_ARCHS "10.0a")?

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 28, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; good catch! Thanks for the fix!

@LucasWilkinson
Copy link
Collaborator

LucasWilkinson commented Jun 28, 2025

Side note (FYI @mgoin): we should probably restrict the use of a variants to only when it is necessary (i.e. we use one of the a features / instructions). e.g. is dont see anything in moe data that would need a (adds wgmma); we should probably just build these for 9.0 and 10.0 (Nvidia has made there compute capability stuff so confusing now 😞)

ill open a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants