Skip to content

buildkite release pipeline: add torch_cuda_arch_list including 12.0 to the Docker "Build release image" build args in order to enable Blackwell SM120 support #19747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cyril23
Copy link

@cyril23 cyril23 commented Jun 17, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Purpose

We need Nvidia Blackwell SM120 support. There are a lot of issues regarding it.

  1. Some tried to use the vLLM pypi wheels - which we unfortunately can't update to SM120 at the moment since we're restricted by the pypi's 400 MB limit and even building for SM120-only results in a 548 MB wheel, and building for all arches results in a 922 MB wheel which even exceeds the proposed 800 MB new limit (see here)
  1. Some want to start vLLM via docker - this is what this Pullrequest is solving:

There are two alternative solutions for solving Number 2 (proving a SM 120 capable Docker image), see #13306 (comment)

So in order to get SM120 Blackwell support at least in the Docker AWS image, the workaround would be to:

  1. either add --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0+PTX' to Build wheel - CUDA 12.8 (to keep wheelsize small) and add 12.0 to the Dockerfile to make it the new default
  2. or add --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 10.1 12.0+PTX' to the Docker Build release image args which only affect the Docker image being pushed to AWS and not the wheels

This is the PR for alternative 2 since I didn't want to touch the defaults.

Besides adding --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 10.1 12.0+PTX' I needed to add --build-arg RUN_WHEEL_CHECK=false too of course. This shouldn't be a problem since AWS does not have the same size restrictions as pypi's wheel size limits.

Test Plan

  1. Make a test build, very similar to the line I've touched in this pullrequest:
DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=96 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 10.1 12.0+PTX' \
  --build-arg RUN_WHEEL_CHECK=false \
  --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:dev \
  --target vllm-openai \
  --progress plain \
  -f docker/Dockerfile .
  1. Push this image to Dockerhub under the following image name and tag: wurstdeploy/vllm:dev
  2. Run this image using 1x RTX 6000 Pro and in another test using 1x RTX 4090

Test Result

  1. I've built it succesfully using an Azure Standard E96s v6 (96 vcpus, 768 GiB memory) machine. The build took ~1 hour, max system memory usage, including OS: 181125 MiB RAM. This build is based on the following commit:
# After cloing the repo via git clone https://github.com/vllm-project/vllm
azureuser@buildtest:~/vllm$ git log -1
commit ccd7c050898cfea4b6b0de16446dcc47fa02a399 (HEAD -> main, origin/main, origin/HEAD)
Author: jvlunteren <161835099+jvlunteren@users.noreply.github.com>
Date:   Tue Jun 17 12:45:07 2025 +0200

    [Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19152)

    Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

  1. Image is now available at Docker Hub on wurstdeploy/vllm:dev. I did it as follows:
sudo docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:dev wurstdeploy/vllm:dev
sudo docker login
sudo docker push wurstdeploy/vllm:dev
  1. Both tests ran successful, using TinyLlama/TinyLlama-1.1B-Chat-v1.0 and a test prompt. Screenshots: [1], [2], [3] edit: but it only works with the docker environment variable -e VLLM_USE_FLASHINFER_SAMPLER=0 being set, else I get a kernel error. Investigation follows in the comments

…o the Docker "Build release image" build args in order to enable Blackwell SM120 support, see vllm-project#13306 (comment)
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Jun 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @cyril23, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the project's Docker image build process to support newer Nvidia GPU architectures, specifically Blackwell SM120. By modifying the Buildkite release pipeline, I've ensured that the Docker images are compiled with the necessary CUDA architecture flags, addressing a critical compatibility gap for users leveraging the project's Docker distributions.

Highlights

  • Buildkite Pipeline Update: I've updated the .buildkite/release-pipeline.yaml to modify the docker build command for the vllm-openai target. This change specifically adds the --build-arg torch_cuda_arch_list with an extended list of CUDA architectures, including 12.0+PTX.
  • Nvidia Blackwell SM120 Support: The primary purpose of adding 12.0+PTX to the torch_cuda_arch_list is to enable support for Nvidia Blackwell SM120 GPUs within the Docker images built by this pipeline. This addresses the need for newer hardware compatibility, particularly for users deploying via Docker.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@cyril23 cyril23 changed the title buildkite release pipeline: add torch_cuda_arch_list including 12.0 t… buildkite release pipeline: add torch_cuda_arch_list including 12.0 to the Docker "Build release image" build args in order to enable Blackwell SM120 support Jun 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request adds support for Nvidia Blackwell SM120 by including the 12.0 architecture in the torch_cuda_arch_list build argument for the Docker image build process. This change addresses issues related to using vLLM with Docker and provides a workaround for the PyPI wheel size limit. A suggestion was made to also include RUN_WHEEL_CHECK=false.

…ecause here we build the Docker image only for AWS, not for pypi.
@cyril23
Copy link
Author

cyril23 commented Jun 17, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR updates the Buildkite release pipeline to include support for Nvidia Blackwell GPUs by modifying the torch_cuda_arch_list build argument and disabling the wheel size check. The primary feedback concerns the specific values used in torch_cuda_arch_list, suggesting a more standard list for specifying CUDA compute capabilities.

@cyril23
Copy link
Author

cyril23 commented Jun 17, 2025

/gemini summary

Copy link
Contributor

Summary of Changes

This pull request focuses on enabling NVIDIA Blackwell SM120 GPU support for vLLM when deployed via Docker images. It achieves this by updating the Buildkite release pipeline to pass specific build arguments to the Docker image creation process, including an expanded list of supported CUDA architectures and disabling wheel size checks. This strategy provides Blackwell compatibility for Docker users, circumventing the size constraints faced by PyPI wheel distributions.

Highlights

  • Blackwell SM120 Support for Docker Images: This pull request introduces support for NVIDIA Blackwell SM120 GPUs specifically for vLLM Docker release images. This addresses the challenge of providing Blackwell support without exceeding PyPI wheel size limits.
  • Buildkite Pipeline Configuration Update: The .buildkite/release-pipeline.yaml file has been modified to update the docker build command for the 'Build release image' step. This update includes new build arguments to enable broader GPU architecture support.
  • Expanded CUDA Architecture List: The torch_cuda_arch_list build argument has been extended to include a comprehensive set of compute capabilities: '7.0 7.5 8.0 8.6 8.9 9.0 10.0 10.1 12.0+PTX'. This ensures compatibility with a wider range of NVIDIA GPUs, including the latest Blackwell architecture (CC 12.0).
  • Disabled Wheel Size Check for Docker Builds: The --build-arg RUN_WHEEL_CHECK=false argument has been added to the Docker build command. This bypasses the wheel size validation, which is necessary because AWS ECR (where the Docker images are pushed) does not have the same size restrictions as PyPI.
Changelog
  • .buildkite/release-pipeline.yaml
    • Modified the docker build command on line 55 to include --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 10.1 12.0+PTX' to enable support for a wider range of NVIDIA GPU architectures, including Blackwell SM120.
    • Added --build-arg RUN_WHEEL_CHECK=false to the docker build command on line 55 to skip the wheel size validation during the Docker image build process.
Activity
  • github-actions[bot] provided an initial welcome message and guidance on triggering full CI runs.
  • cyril23 initiated a /gemini review and /gemini summary command.
  • gemini-code-assist[bot] provided a review comment suggesting the addition of RUN_WHEEL_CHECK=false for consistency, which cyril23 confirmed was already addressed or would be.
  • gemini-code-assist[bot] raised a high-priority concern regarding the torch_cuda_arch_list values, specifically questioning the inclusion of 10.1 and 12.0.
  • cyril23 responded to the torch_cuda_arch_list concern, providing detailed justification and references to NVIDIA documentation and the project's CMakeLists.txt to validate the specified compute capabilities.

@cyril23
Copy link
Author

cyril23 commented Jun 17, 2025

I've tried running my Docker image wurstdeploy/vllm:dev on my desktop PC (RTX 5090) via

sudo docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000  wurstdeploy/vllm:dev \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

but got serious problems

/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
...
RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device

full output: problem.txt

I'll try build it again on my PC now, using

DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='12.0+PTX' \
  --build-arg RUN_WHEEL_CHECK=false \
  --tag wurstdeploy/vllm:solo120 --target vllm-openai \
  --progress plain -f docker/Dockerfile .

and see how it works.
edit: it did not work either, same errors. I'll build with torch_cuda_arch_list='12.0' now (without PTX) to see if that makes a difference. If that still doesn't work I'll add 12.1

@cyril23
Copy link
Author

cyril23 commented Jun 18, 2025

I've tried running my Docker image wurstdeploy/vllm:dev on my desktop PC (RTX 5090) via
but got serious problems

  • this is just because of some recent Commit, newer than v0.9.1 broke some SM120 things and this has nothing to do with my changes in this Pullrequest.
  • doing a "12.0" build for the current vLLM results in that error. That's why wurstdeploy/vllm:dev (which includes all arches proposed in the pullrequest) or wurstdeploy/vllm:azurecurrentsolo120 (only contains "12.0") only work on SM120 with -e VLLM_USE_FLASHINFER_SAMPLER=0 being set.
git clone https://github.com/vllm-project/vllm/
cd vllm/
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azurecurrentsolo120 --target vllm-openai   --progress plain -f docker/Dockerfile .
  • building "12.0" build for either vLLM v0.9.1 or v0.9.0 work however. That's why wurstdeploy/vllm:azurev090solo120 and wurstdeploy/vllm:azurev091solo120 work on RTX 5090 without a problem i.e. -e VLLM_USE_FLASHINFER_SAMPLER=0 and -e VLLM_USE_FLASHINFER_SAMPLER=1 both work.
# how to build wurstdeploy/vllm:azurev090solo120
git clone https://github.com/vllm-project/vllm vllm-v090
cd vllm-v090/
git checkout -b mybranch090 tags/v0.9.0
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azurev090solo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

# how to build wurstdeploy/vllm:azurev091solo120
git clone https://github.com/vllm-project/vllm vllm-v091
cd vllm-v091/
git checkout -b mybranch091 tags/v0.9.1
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azurev091solo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

I'm trying to identify now which of those recent Commits broke SM120 compability.

@cyril23
Copy link
Author

cyril23 commented Jun 18, 2025

I'm trying to identify now which of those recent Commits broke SM120 compability.

  • the last commit of 10th June (da9b523) still work with -e VLLM_USE_FLASHINFER_SAMPLER=1
  • the last commit of 11th June (42f52cc) do NOT work anymore with -e VLLM_USE_FLASHINFER_SAMPLER=1 since there are no kernels available for SM120.
  • I guess this is the problematic commit: 497a91e since it is the only Flashinfer related Commit that day, and updating its version may very well lead to such problems. Related pullrequest: [CI] Update FlashInfer to 0.2.6.post1 #19297

edit: I've opened an Issue about it here: #19810

@cyril23
Copy link
Author

cyril23 commented Jun 18, 2025

edit: I've opened an Issue about it here: #19810

Alternative solutions, proposed in #19810 (comment) :

  • either revert 497a91e
  • or just add 12.0 to the FlashInfer torch_cuda_arch_list - Preferred solution since it keeps the newer FlashInfer and just adds SM120 compability. I can close this PR here if [CI] Add SM120 to the Dockerfile #19794 gets merged because it contains this solution and it finally updates the Dockerfile's default torch_cuda_arch_list parameter and thus adds SM 120 support to both the Docker release image and the Pypi CUDA 12.8 wheels. Currently confirming the solution [CI] Add SM120 to the Dockerfile #19794 (comment)

@mgoin
Copy link
Member

mgoin commented Jun 19, 2025

Thank you for aggregating all of these issues and information @cyril23 , it is really great. I think we would like to add SM 12.0 in the base dockerfile and wheel (in my mentioned PR) to solve most issues now, so we can hopefully close them!

@cyril23
Copy link
Author

cyril23 commented Jun 19, 2025

I think we would like to add SM 12.0 in the base dockerfile and wheel (in my mentioned PR) to solve most issues now, so we can hopefully close them!

absolutely right, i'll close this PR here

@cyril23 cyril23 closed this Jun 19, 2025
@cyril23 cyril23 deleted the sm120-docker-only branch June 19, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants