Skip to content

[CI] Add SM120 to the Dockerfile #19794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 25, 2025

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jun 18, 2025

Now that #19336 has landed, maybe we can add SM 12.0 without going over the 400MB wheel limit

EDIT: The wheel is 365MB!

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mgoin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to expand the Docker image's compatibility with newer NVIDIA GPU architectures by adding support for SM 12.0. This change is crucial for leveraging the latest hardware capabilities within the Dockerized environment, building upon a previous effort to manage wheel size limits.

Highlights

  • CUDA Architecture Support: I've updated the torch_cuda_arch_list build argument in the Dockerfile to include 12.0. This enables PyTorch builds within the Docker image to support NVIDIA GPUs with SM (Streaming Multiprocessor) capability 12.0, which corresponds to newer GPU architectures.
  • FlashInfer Compilation: I've also extended the TORCH_CUDA_ARCH_LIST environment variable to include 12.0 specifically for the flashinfer library's compilation path. This ensures that flashinfer is built with support for SM 12.0 when not using pre-built wheels for CUDA 12.8.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the ci/build label Jun 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR updates the Dockerfile to include CUDA Compute Capability SM 12.0 in the TORCH_CUDA_ARCH_LIST. This affects the main vLLM wheel build process and the compilation of flashinfer from source for non-default CUDA versions (i.e., when CUDA_VERSION is not 12.8.*).

The changes themselves appear correct for enabling SM 12.0 in these compilation steps.

My main feedback point, detailed in a specific comment, is a request for clarification regarding flashinfer's SM 12.0 support when using the default CUDA_VERSION=12.8.1. For this version, a pre-built flashinfer wheel is used, and it's important to understand if this pre-built wheel also includes SM 12.0 capabilities.

Additionally, for documentation consistency within the Dockerfile, the example TORCH_CUDA_ARCH_LIST for flashinfer in the comment block on lines 246-255 (specifically line 247, which currently shows '7.5 8.0 8.9 9.0a 10.0a') could be updated to include 12.0. This would help future maintainers by reflecting the architectures now typically compiled for flashinfer due to this PR's changes. Since this comment block is outside the diff, this is a suggestion for general consideration.

@houseroad
Copy link
Collaborator

What's the new wheel size? :-)

@mgoin
Copy link
Member Author

mgoin commented Jun 18, 2025

The wheel is 365MB!

@cyril23
Copy link

cyril23 commented Jun 18, 2025

The wheel is 365MB!

Sounds awesome! I'll try to confirm. Currently building the whole thing on my desktop, it'll take a while:

~/vllm$ git status
On branch neuralmagic-add-sm120-dockerfile
Your branch is up to date with 'neuralmagic/add-sm120-dockerfile'.
nothing to commit, working tree clean

~/vllm$ git log -1
commit f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 (HEAD -> neuralmagic-add-sm120-dockerfile, neuralmagic/add-sm120-dockerfile)
Author: Michael Goin <mgoin64@gmail.com>
Date:   Thu Jun 19 01:21:08 2025 +0900

    Update Dockerfile

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

# edit:   --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' is not needed anymore of course
# then extract the wheel from the build stage, check size, and build image via target vllm-openai
  • ❌ confirm the new wheel size of 365MB edit: nope, the new wheel size is 832.61 MiB when building for the new default arch list (same as --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'), see this comment edit:
    ✅ confirmed, see [CI] Add SM120 to the Dockerfile #19794 (comment)

  • ✅ confirm SM 120 compability (for FlashInfer, too)
    edit: probably needs huydhn's rebuilt wheel for the new arch list. edit: Yes, else I get the error

    RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device

    edit: tested on RTX 5090, it works now with the new flashinfer wheel

@cyril23
Copy link

cyril23 commented Jun 19, 2025

The wheel is 365MB!

Do you mean for SM 120 (torch_cuda_arch_list='12.0') only? What have you tested exactly?

  • I am sorry but the wheel size for ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' (your new default) is 832.61 MiB which is still too big. I've built it based on your branch, with default settings, see [CI] Add SM120 to the Dockerfile #19794 (comment)
  • Output
#23 DONE 23847.4s

#24 [build 7/8] COPY .buildkite/check-wheel-size.py check-wheel-size.py
#24 DONE 0.0s

#25 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#25 0.274 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (832.61 MB) than the limit (400 MB).
#25 0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
#25 0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
#25 0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#25 0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
#25 0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
#25 0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#25 0.274 vllm/config.py: 0.20 MBs uncompressed.
#25 0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#25 0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#25 0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#25 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
------
 > [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi:
0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
0.274 vllm/config.py: 0.20 MBs uncompressed.
0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
------
Dockerfile:155
--------------------
 154 |     ARG RUN_WHEEL_CHECK=true
 155 | >>> RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
 156 | >>>         python3 check-wheel-size.py dist; \
 157 | >>>     else \
 158 | >>>         echo "Skipping wheel size check."; \
 159 | >>>     fi
 160 |     #################### EXTENSION Build IMAGE ####################
--------------------
ERROR: failed to solve: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
  • Running the command again but this time with disabled wheel-size-check in order to let me extract the wheel and finish the build-image:
DOCKER_BUILDKIT=1 sudo docker build   --build-arg max_jobs=5   --build-arg USE_SCCACHE=0   --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1 --build-arg RUN_WHEEL_CHECK=fals
e --tag wurstdeploy/vllm:wheel-stage   --target build   --progress plain   -f docker/Dockerfile .
sudo docker create --name temp-wheel-container wurstdeploy/vllm:wheel-stage
sudo docker cp temp-wheel-container:/workspace/dist ./extracted-wheels
sudo docker rm temp-wheel-container
ls -la extracted-wheels/
# output:
total 852604
drwxr-xr-x  2 root     root          4096 Jun 19 08:35 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 19 08:50 ..
-rw-r--r--  1 root     root     873053002 Jun 19 08:36 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# That's 873 MB or 832.61 MiB

Unfortunately we still can't update the defaults of the Dockerfile to include SM120, without touching anything else, because it'd be applied to building the CUDA 12.8 wheel here, too, and Pypi's limit of currently 400 MB is too low (even increasing it to 800 MB would not be enough).
How could we solve this problem:

  1. Either we keep your changes to the main Dockerfile as you did in this PR but build for specific architectures within the Build wheel - CUDA 12.8 step here:
    1.1 Either by adding --build-arg torch_cuda_arch_list='12.0' (I havn't confirmed your 365MB yet when building 12.0 only) to make a SM120-only-compatible build, incompatible for all older achitectures like SM 100 Blackwell and older.
    1.2 Or by adding the old default --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' (with or without PTX, does not matter) the CUDA 12.8 wheel would still be incompatible for SM 120 Blackwell but works for SM 100 Blackwell and all older gens. So just like the current wheel.
  2. Or we do not update the main Dockerfile but explicitly add something like --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0' --build-arg RUN_WHEEL_CHECK=false to the Docker Build release image step here which was the idea of my PR buildkite release pipeline: add torch_cuda_arch_list including 12.0 to the Docker "Build release image" build args in order to enable Blackwell SM120 support #19747

I prefer solution 1.2. What do you guys think? @mgoin

@mgoin
Copy link
Member Author

mgoin commented Jun 19, 2025

Hey @cyril23 thanks for the concern but the "build image" job in CI succeeds. This is the source of truth for wheel size and is now building for '7.0 7.5 8.0 8.9 9.0 10.0 12.0': https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

@cyril23
Copy link

cyril23 commented Jun 19, 2025

my wheels are bigger because I build it with USE_SCCACHE=0 and thus not building CMAKE_BUILD_TYPE=Release but including debug symbols etc.

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

I wish I built it the wrong way, so we could just merge this PR. I've built it as shown here which got me 832.61 MiB wheel size.

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Now I've just tried building again for SM 120 only:

# on Azure Standard E96s v6 (96 vcpus, 768 GiB memory); actually used Max: 291289 MiB RAM
DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=384 \
  --build-arg nvcc_threads=4 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='12.0' \
  --tag wurstdeploy/vllm:wheel-stage-120only \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Result:

#24 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#24 0.251 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (558.31 MB) than the limit (400 MB).
#24 0.251 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1504.86 MBs uncompressed.
#24 0.251 vllm/_C.abi3.so: 297.77 MBs uncompressed.
#24 0.251 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#24 0.251 vllm/_moe_C.abi3.so: 95.23 MBs uncompressed.
#24 0.251 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#24 0.251 vllm/config.py: 0.20 MBs uncompressed.
#24 0.251 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#24 0.251 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#24 0.251 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#24 0.251 vllm/worker/hpu_model_runner.py: 0.10 MBs uncompressed.
#24 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1

After extracting the wheels:

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 571720
drwxr-xr-x  2 root      root           4096 Jun 19 08:09 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 08:14 ..
-rw-r--r--  1 root      root      585426919 Jun 19 08:10 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# thats 558 MB or 558.31 MiB

I am not sure what https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367 did differently? They build the "test" target.

Anyway as long as it works on buildkite I am happy! Would love to understand the differences though.

edit: this is what buildkite did:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
#!/bin/bash
if [[ -z $(docker manifest inspect public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1) ]]; then
echo "Image not found, proceeding with build..."
else
echo "Image found"
exit 0
fi

docker build --file docker/Dockerfile --build-arg max_jobs=16 --build-arg buildkite_commit=f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --build-arg USE_SCCACHE=1 --tag public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --target test --progress plain .
docker push public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1

edit: the differences:

  • I use a different number of max_job which shouldn't affect the wheel size
  • I did USE_SCCACHE=0 instead of 1 as in buildkite - can this affect the wheel size? YES, thanks Gemini:

The difference in wheel size between your local build and the Buildkite build is most likely due to the USE_SCCACHE build argument and its effect on the build type.

Here's a breakdown of why this is happening:

The Root Cause
In the docker/Dockerfile, the USE_SCCACHE argument controls which build path is taken. When USE_SCCACHE is set to 1 (as it is in the Buildkite CI), the build command also sets CMAKE_BUILD_TYPE=Release:

# docker/Dockerfile

...
RUN --mount=type=bind,source=.git,target=.git \
    if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Installing sccache..." \
...
        && export CMAKE_BUILD_TYPE=Release \
        && sccache --show-stats \
        && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
...
    fi
...

However, when USE_SCCACHE is not 1 (you are setting it to 0), the other build path is taken, and CMAKE_BUILD_TYPE is not set:

# docker/Dockerfile

...
ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
    --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=.git,target=.git  \
    if [ "$USE_SCCACHE" != "1" ]; then \
        # Clean any existing CMake artifacts
        rm -rf .deps && \
        mkdir -p .deps && \
        python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
    fi
...

When CMAKE_BUILD_TYPE is not explicitly set, CMake often defaults to a Debug build, which includes debugging symbols and is not optimized for size. This is why your locally built wheel is so much larger. The huge size of the .so files in your output is a strong indicator of this.

You didn't accidentally set any debug flags; you accidentally missed setting the release flag!

How to Fix It
To fix this, you have two options:

  1. Set USE_SCCACHE in your build command:

The easiest solution is to mimic the CI environment by setting --build-arg USE_SCCACHE=1 in your docker build command. This will ensure that CMAKE_BUILD_TYPE=Release is set.

DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=1 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .
  1. Modify the Dockerfile:

If you prefer to build without sccache locally, you can modify the Dockerfile to set the CMAKE_BUILD_TYPE for both build paths. This would make local builds more consistent with CI builds, regardless of the USE_SCCACHE setting.

Here is a diff of the proposed change:

--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -141,6 +141,7 @@
     if [ "$USE_SCCACHE" != "1" ]; then \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
+        export CMAKE_BUILD_TYPE=Release && \
         mkdir -p .deps && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

By making one of these changes, you should see your wheel size decrease significantly and fall within the acceptable range.

edit: I'll propose a new ARG CMAKE_BUILD_TYPE=Release build argument in a separate issue to allow for creating a Release type build even without using SCCACHE.

@cyril23
Copy link

cyril23 commented Jun 19, 2025

The wheel is 365MB!

Now I've verified that using CMAKE_BUILD_TYPE=Release with default arches indeed results in a 382 MB file (365.10 MiB) i.e. exactly as in the buildkite run https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 373876
drwxr-xr-x  2 root      root           4096 Jun 19 09:20 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:22 ..
-rw-r--r--  1 root      root      382836018 Jun 19 09:21 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm$

By the way I've further tested that using CMAKE_BUILD_TYPE=Release for SM 120-only (--build-arg torch_cuda_arch_list='12.0') now results in a 167 MB (159.34 MiB) small wheel .

azureuser@building:~/vllm/extracted-wheels$ ls -la
total 163172
drwxr-xr-x  2 root      root           4096 Jun 19 09:00 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:02 ..
-rw-r--r--  1 root      root      167077290 Jun 19 09:01 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm/extracted-wheels$

In order to test it without using SCCACHE I've modified my Dockerfile as follows (I'll make an issue about it):

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 8d4375470..ae866edd0 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -112,6 +112,7 @@ ENV MAX_JOBS=${max_jobs}
 ARG nvcc_threads=8
 ENV NVCC_THREADS=$nvcc_threads

+ARG CMAKE_BUILD_TYPE=Release
 ARG USE_SCCACHE
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
@@ -129,7 +130,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
         && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
         && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
         && export SCCACHE_IDLE_TIMEOUT=0 \
-        && export CMAKE_BUILD_TYPE=Release \
+        && export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} \
         && sccache --show-stats \
         && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
         && sccache --show-stats; \
@@ -143,6 +144,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
         mkdir -p .deps && \
+        export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

So let's merge! 👍

@cyril23
Copy link

cyril23 commented Jun 20, 2025

With the new FlashInfer wheel, I've tried it out with RTX 5090 (but just build it using torch_cuda_arch_list='12.0', and CMAKE_BUILD_TYPE=Release) and inference works without a problem

edit: by the way the wheel size is pretty much the same like with the old FlashInfer version (compared to #19794 (comment))

~/vllm$ ls -la extracted-wheels/
total 163192
drwxr-xr-x  2 root     root          4096 Jun 20 10:10 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 20 10:16 ..
-rw-r--r--  1 root     root     167097574 Jun 20 10:10 vllm-0.9.2.dev182+g47c454049.d20250620-cp38-abi3-linux_x86_64.whl

@polarathene
Copy link

  1. 1.2 Or by adding the old default --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' (with or without PTX, does not matter) the CUDA 12.8 wheel would still be incompatible for SM 120 Blackwell but works for SM 100 Blackwell and all older gens. So just like the current wheel.

@cyril23 could you please provide context as to why that would be the case regarding PTX?

If you compile CUDA kernels with PTX, any earlier Compute Capability (CC) should be able to be compiled by newer GPUs, and they could use that.

There would be some overhead (at least on first run) as PTX is compiled to cubin at runtime, and not targeting newer CC of that GPU would be less optimal (perf impact varies) but should still work.

The only time this doesn't really workout is when the PTX is built with a newer version of CUDA than runtime uses.


The builder image is using nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 (with a CUDA_VERSION ARG of 12.8.1), being as high as it is would prevent the PTX being compatible if your runtime was using CUDA 12.8.0.

Beyond that, since you're also relying on PyTorch which bundles it's own CUDA libraries, depending on the CUDA release there you'll also have each library with embedded PTX/cubin. If they are lacking sm_120 cubin or any of the PTX was built with that CUDA compatibility issue mentioned, then you'd not have valid GPU kernels to load.


You've not mentioned what version of CUDA you're using at runtime, but it's possible that the compatibility issue was related to these caveats I've described.

CUDA 12.8.0 can target sm_120, and the existing CC 10.0 PTX should have been compatible, so the only scenario that comes to mind is due to CUDA 12.8.1 in the image builder, when your runtime might have been using CUDA 12.8.0 still?

$ docker run --rm -it nvidia/cuda:12.8.0-devel-ubuntu24.04

$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

If there is some other compatibility caveat, I'd appreciate more details, as --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' with PTX should have otherwise worked 🤔

@polarathene
Copy link

polarathene commented Jun 22, 2025

This is not advisable btw:

# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

That updates /etc/ld.so.cache (equivalent of LD_LIBRARY_PATH) to include this location for libcuda.so.1, and also creates a libcuda.so symlink.

Instead you can replace compat/ with lib64/stubs which should have a libcuda.so file if needed for linking. This is only present in the devel image as it's only relevant to building. At runtime a proper libcuda.so should be provided.


If you try to use the image for runtime purposes with that, and the compat version of libcuda.so.1 is used instead of the one from your actual driver, this can introduce issues like the CUDA device not being detected.

These compat packages are not intended to be used with newer versions of CUDA, you can't use CUDA 12.9 on the host and swap for an earlier CUDA 12.8 compat package.

@cyril23
Copy link

cyril23 commented Jun 22, 2025

@cyril23 could you please provide context as to why that would be the case regarding PTX?

If you compile CUDA kernels with PTX, any earlier Compute Capability (CC) should be able to be compiled by newer GPUs, and they could use that.

@polarathene You're right that generally it should run with 10.0+PTX (or any older version+PTX). And this is actually the first time I ran it without kernel problems, maybe before I had the wrong CUDA version or flashinfer was not compatible or who knows what I did wrong. Anyhow today I've built it from this branch neuralmagic:add-sm120-dockerfile with --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='10.0+PTX', and indeed it ran, but somehow produced gibberish:
image

Build: build-10ptx.log

# building without SCCACHE and non-release, therefore better deactivating wheel check:
DOCKER_BUILDKIT=1 sudo docker build \
 --build-arg max_jobs=6 \
 --build-arg nvcc_threads=1 \
 --build-arg USE_SCCACHE=0 \
 --build-arg GIT_REPO_CHECK=1 \
 --build-arg RUN_WHEEL_CHECK=false \
 --build-arg CUDA_VERSION=12.8.1 \
 --build-arg torch_cuda_arch_list='10.0+PTX' \
 --tag wurstdeploy/vllm:sm100ptxonly \
 --target vllm-openai \
 --progress plain \
 -f docker/Dockerfile .

Run:

~/vllm$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000  wurstdeploy/vllm:sm100ptxonly    --model Qwen/Qwen3-0.6B
INFO 06-22 02:53:54 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 02:53:56 [api_server.py:1287] vLLM API server version 0.9.2.dev182+g47c454049
INFO 06-22 02:53:56 [cli_args.py:309] non-default args: {'model': 'Qwen/Qwen3-0.6B'}
INFO 06-22 02:54:06 [config.py:831] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 06-22 02:54:06 [config.py:1444] Using max model len 40960
INFO 06-22 02:54:06 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-22 02:54:09 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 02:54:10 [core.py:459] Waiting for init message from front-end.
INFO 06-22 02:54:10 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev182+g47c454049) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-22 02:54:12 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9b41cd5310>
INFO 06-22 02:54:12 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-22 02:54:12 [interface.py:383] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 06-22 02:54:12 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-22 02:54:12 [gpu_model_runner.py:1691] Starting to load model Qwen/Qwen3-0.6B...
INFO 06-22 02:54:12 [gpu_model_runner.py:1696] Loading model from scratch...
INFO 06-22 02:54:12 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-22 02:54:16 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 06-22 02:54:16 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.29it/s]

INFO 06-22 02:54:17 [default_loader.py:272] Loading weights took 0.79 seconds
INFO 06-22 02:54:17 [gpu_model_runner.py:1720] Model loading took 1.1201 GiB and 5.040871 seconds
INFO 06-22 02:54:21 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/ef58c0fce0/rank_0_0/backbone for vLLM's torch.compile
INFO 06-22 02:54:21 [backends.py:519] Dynamo bytecode transform time: 3.76 s
INFO 06-22 02:54:23 [backends.py:181] Cache the graph of shape None for later use
INFO 06-22 02:54:36 [backends.py:193] Compiling a graph for general shape takes 14.65 s
INFO 06-22 02:54:47 [monitor.py:34] torch.compile takes 18.40 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
INFO 06-22 02:54:47 [gpu_worker.py:232] Available KV cache memory: 26.91 GiB
INFO 06-22 02:54:47 [kv_cache_utils.py:716] GPU KV cache size: 251,920 tokens
INFO 06-22 02:54:47 [kv_cache_utils.py:720] Maximum concurrency for 40,960 tokens per request: 6.15x
WARNING 06-22 02:54:47 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graphs: 100%|██████████| 67/67 [00:14<00:00,  4.71it/s]
INFO 06-22 02:55:02 [gpu_model_runner.py:2196] Graph capturing finished in 14 secs, took 0.84 GiB
INFO 06-22 02:55:02 [core.py:172] init engine (profile, create kv cache, warmup model) took 44.51 seconds
INFO 06-22 02:55:02 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 15745
WARNING 06-22 02:55:02 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 06-22 02:55:02 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 06-22 02:55:03 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 06-22 02:55:03 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 06-22 02:55:03 [launcher.py:29] Available routes are:
INFO 06-22 02:55:03 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /health, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /load, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /version, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /score, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 06-22 02:55:18 [chat_utils.py:420] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 06-22 02:55:18 [logger.py:43] Received request chatcmpl-1f0f012cce7f4176b9d627ec57efc7a0: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n/no_think What is the capital of France? Tell me 2 sentences about it<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40924, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-22 02:55:18 [async_llm.py:270] Added request chatcmpl-1f0f012cce7f4176b9d627ec57efc7a0.
INFO 06-22 02:55:43 [loggers.py:118] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:55:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     172.17.0.1:35466 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-22 02:56:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:56:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:57:33 [logger.py:43] Received request chatcmpl-f956c735835145b692f08178d679a17b: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n/no_think What is the capital of France? Tell me 2 sentences about it<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40924, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-22 02:57:33 [async_llm.py:270] Added request chatcmpl-f956c735835145b692f08178d679a17b.
INFO 06-22 02:57:43 [loggers.py:118] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 193.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 44.4%
INFO 06-22 02:57:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 44.4%
INFO 06-22 02:58:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 190.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 44.4%
INFO:     172.17.0.1:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-22 02:58:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 44.4%
INFO 06-22 02:58:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 44.4%

Another try with cURL, same prompt:
image

systeminfo.txt including nvidia smi etc.

By the way similar gibberish when running vLLM with -e VLLM_USE_FLASHINFER_SAMPLER=0: VLLM_USE_FLASHINFER_SAMPLER=0.txt

I've uploaded my build wurstdeploy/vllm:sm100ptxonly to Dockerhub in case you want to test it.

@polarathene
Copy link

indeed it ran, but somehow produced gibberish

I don't own a Blackwell GPU so I cannot test.

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host, perhaps try version bumping the CUDA version in the Dockerfile stages you're using, or alternatively only build for an earlier arch like Ada (sm_89, RTX 4xxx) with PTX for CC 8.9 (I'm personally more interested in this to rule out the CUDA 12.8 / CC 10.0 concern).

If you have issues with either of those still, then try bring the version of CUDA from the builder down to an earlier CUDA version (I still seem some projects on CUDA 12.2 / 12.4 for their image builds).

It would be helpful information for other projects trying to update their support for Blackwell, I've seen a few other project PRs where there is some reluctance to bump the builder stage CUDA.

@polarathene
Copy link

@cyril23 when you run the container, can you show this output?

# This is a public image from the CI, but you could use your image `wurstdeploy/vllm:sm100ptxonly`:
# (Use entrypoint to switch to bash shell instead in container instead)
$ docker run --rm -it --runtime nvidia --gpus all --entrypoint bash \
  public.ecr.aws/q9t5s3a7/vllm-release-repo:b6553be1bc75f046b00046a4ad7576364d03c835

# In the container run this command to see which `libcuda.so.1` is resolved:
# (this is the output without `--runtime nvidia --gpus all`, but it shouldn't have been cached)
$ ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so

Does it resolve to that same /usr/local/cuda-12.8/compat like above?

If it does run ldconfig command by itself with nothing else after it, and then repeat the same ldconfig -p with grep above, it should show a different path to your proper libcuda.so.1, such as:

ldconfig -p | grep -F libcuda.so
        libcuda.so.1 (libc6,x86-64) => /usr/lib64/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib64/libcuda.so

If it's already using /usr/lib64 by default nevermind 🤔


I'm also not quite familiar why the runtime image is over 11GB (21GB uncompressed)?

There's the CUDA libs from the image itself (/usr/local/cuda, 7GB), plus another copy in the Python packages /usr/local/lib/python3.12/dist-packages/nvidiais 4GB, dist-packages is 11GB total.

I think there are some linking mistakes as a result of this...?

$ `ls -lh /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib
total 857M
-rw-r--r-- 1 root root    0 Jun 10 12:08 __init__.py
-rw-r--r-- 1 root root 111M Jun 10 12:08 libcublas.so.12
-rw-r--r-- 1 root root 745M Jun 10 12:08 libcublasLt.so.12
-rw-r--r-- 1 root root 737K Jun 10 12:08 libnvblas.so.12


# Notice how the library is resolving `libcublasLt.so.12` to the non-local one instead?
$ ldd /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublas.so.12
        linux-vdso.so.1 (0x00007ffc0e7af000)
        libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007571c6800000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007571ffefb000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007571ffef6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007571ffef1000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007571ffe0a000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007571ffde8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007571c65d7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007571fff0a000)


# Different number of cubins for `sm_120`, these two lib copies are not equivalent:
$ cuobjdump --list-elf /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublasLt.so.12 | grep sm_120 | wc -l
1380

$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublasLt.so.12 | grep sm_120 | wc -l
1432

I wouldn't be surprised if the above contributes to some of the issues encountered?

@cyril23
Copy link

cyril23 commented Jun 22, 2025

@polarathene

In the container run this command to see which libcuda.so.1 is resolved:

~$ sudo docker run --rm -it --runtime nvidia --gpus all --entrypoint bash \
  public.ecr.aws/q9t5s3a7/vllm-release-repo:b6553be1bc75f046b00046a4ad7576364d03c835
root@666374f39152:/vllm-workspace# ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
root@666374f39152:/vllm-workspace# exit
exit
freeuser@computer:~$ sudo docker run --rm -it --runtime nvidia --gpus all --entrypoint bash   wurstdeploy/vllm:sm100ptxonly
root@0c651f6c9519:/vllm-workspace# ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
root@0c651f6c9519:/vllm-workspace# exit
exit

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host,

Here is a test with my wurstdeploy/vllm:sm100ptxonly on simplepod.ai (not affiliated with them, only use it for testing) with a NVIDIA GeForce RTX 5060 Ti and older CUDA Version 12.8:

root@rri_UWM3TmWsoh7cshEI:/vllm-workspace# nvidia-smi
Sun Jun 22 05:39:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02             Driver Version: 570.153.02     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 35%   36C    P1             15W /  180W |   13678MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             159      C   /usr/bin/python3                      13668MiB |
+-----------------------------------------------------------------------------------------+
root@rri_UWM3TmWsoh7cshEI:/vllm-workspace# 

image
Similar gibberish:
image

or alternatively only build for an earlier arch like Ada (sm_89, RTX 4xxx) with PTX for CC 8.9 (I'm personally more interested in this to rule out the CUDA 12.8 / CC 10.0 concern).

I'll do a build now with sm 89 + ptx, and test it tonight:

DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=6 \
  --build-arg nvcc_threads=1 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg RUN_WHEEL_CHECK=false \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='8.9+PTX' \
  --tag wurstdeploy/vllm:sm89ptxonly \
  --target vllm-openai  \
  --progress plain \
  -f docker/Dockerfile .

@Johann-Foerster
Copy link

my 5090 is really looking forward to this being released :) keep up the good work

@cyril23
Copy link

cyril23 commented Jun 22, 2025

I'll do a build now with sm 89 + ptx, and test it tonight:

@polarathene same gibberish with '8.9+PTX' for my RTX 5090, see logs build+run_8.9+PTX.log (inference starts at line 7080)

edit: I've pushed this build 8.9+ptx too: wurstdeploy/vllm:sm89ptxonly which is build with --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='8.9+PTX'.

edit: I've tested wurstdeploy/vllm:sm89ptxonly with RTX 4060 Ti too, but it has CUDA 12.7, therefore I get this

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.8, please update your driver to a newer version, or use an earlier cuda container: unknown. Please contact with support.

edit: I can try building --build-arg CUDA_VERSION=12.7 --build-arg torch_cuda_arch_list='8.9+PTX' but I think I need a version nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 and I can only find 12.6.3 here which should be fine too for Ada Lovelace. Therefore I'll do another build with --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='8.9+PTX'

edit: result of --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='8.9+PTX': builderror_8.9+PTX_CUDA_12.6.3.log, see in short:

#35 10.51 FAILED: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o
#35 10.51 /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /vllm-workspace/flashinfer/include -isystem /vllm-workspace/flashinfer/csrc -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/include -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/tools/util/include -isystem /vllm-workspace/flashinfer/3rdparty/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=sm_120 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_90a,code=sm_90a -O3 -std=c++17 --threads=4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /vllm-workspace/flashinfer/build/aot/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o
#35 10.51 nvcc fatal   : Unsupported gpu architecture 'compute_100a'
#35 16.82 [18/554] c++ -MMD -MF logging/logging.o.d -DTORCH_EXTENSION_NAME=logging -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/vllm-workspace/flashinfer/3rdparty/spdlog/include -I/vllm-workspace/flashinfer/include -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /vllm-workspace/flashinfer/include -isystem /vllm-workspace/flashinfer/csrc -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/include -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/tools/util/include -isystem /vllm-workspace/flashinfer/3rdparty/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -c /vllm-workspace/flashinfer/csrc/logging.cc -o logging/logging.o
#35 16.82 ninja: build stopped: subcommand failed.

@cyril23
Copy link

cyril23 commented Jun 22, 2025

@polarathene

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host, perhaps try version bumping the CUDA version in the Dockerfile stages you're using

I've tried building with --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.9+PTX' and got a build error with this combination: builderror_8.9+PTX_CUDA_12.6.3.log, see in short:

#27 21.90 CMake Error at /usr/local/lib/python3.12/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:186 (set_property):
#27 21.90   The link interface of target "torch::nvtoolsext" contains:
#27 21.90
#27 21.90     CUDA::nvToolsExt
#27 21.90
#27 21.90   but the target was not found.

I've tried building with --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='10.0+PTX' and got a similar build error: builderror_10.0+PTX_CUDA_12.9.1.txt, see in short:

#31 21.92 CMake Error at /usr/local/lib/python3.12/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:186 (set_property):
#31 21.92   The link interface of target "torch::nvtoolsext" contains:
#31 21.92
#31 21.92     CUDA::nvToolsExt
#31 21.92
#31 21.92   but the target was not found.

Not sure what else to test or how to modify the Dockerfile accordingly to make it work. But I think we should make a separate issue about that.

Anyway my takeaway is that building with PTX is not worth it because in order to support a new GPU gen, so many parameters must align (CUDA toolkit and host's GPU driver must match, PyTorch, probably some libraries and modules, CUDA base image), and even then it might result in gibberish output, so I think we should just omit the +PTX flag.

@celsowm
Copy link

celsowm commented Jun 24, 2025

root@srv-ia-010:/var/tmp# curl -O https://raw.githubusercontent.com/vllm-project/vllm/2dd24ebe1538be19fd7b3da8d2bfeed45b0955c4/docker/Dockerfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16424  100 16424    0     0  46974      0 --:--:-- --:--:-- --:--:-- 46925
root@srv-ia-010:/var/tmp# docker build -t vllm:custom -f Dockerfile .

got this warning:

WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 163)

@mgoin mgoin added this to the v0.9.2 milestone Jun 25, 2025
@aarnphm aarnphm changed the title Add SM120 to the Dockerfile [CI] Add SM120 to the Dockerfile Jun 25, 2025
@aarnphm aarnphm enabled auto-merge (squash) June 25, 2025 16:40
@WoosukKwon WoosukKwon disabled auto-merge June 25, 2025 23:23
@WoosukKwon WoosukKwon merged commit 296ce95 into vllm-project:main Jun 25, 2025
96 of 101 checks passed
m-misiura pushed a commit to m-misiura/vllm that referenced this pull request Jun 26, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.