-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[CI] Add SM120 to the Dockerfile #19794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Add SM120 to the Dockerfile #19794
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @mgoin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request aims to expand the Docker image's compatibility with newer NVIDIA GPU architectures by adding support for SM 12.0. This change is crucial for leveraging the latest hardware capabilities within the Dockerized environment, building upon a previous effort to manage wheel size limits.
Highlights
- CUDA Architecture Support: I've updated the
torch_cuda_arch_list
build argument in the Dockerfile to include12.0
. This enables PyTorch builds within the Docker image to support NVIDIA GPUs with SM (Streaming Multiprocessor) capability 12.0, which corresponds to newer GPU architectures. - FlashInfer Compilation: I've also extended the
TORCH_CUDA_ARCH_LIST
environment variable to include12.0
specifically for theflashinfer
library's compilation path. This ensures thatflashinfer
is built with support for SM 12.0 when not using pre-built wheels for CUDA 12.8.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR updates the Dockerfile to include CUDA Compute Capability SM 12.0 in the TORCH_CUDA_ARCH_LIST
. This affects the main vLLM wheel build process and the compilation of flashinfer
from source for non-default CUDA versions (i.e., when CUDA_VERSION
is not 12.8.*
).
The changes themselves appear correct for enabling SM 12.0 in these compilation steps.
My main feedback point, detailed in a specific comment, is a request for clarification regarding flashinfer
's SM 12.0 support when using the default CUDA_VERSION=12.8.1
. For this version, a pre-built flashinfer
wheel is used, and it's important to understand if this pre-built wheel also includes SM 12.0 capabilities.
Additionally, for documentation consistency within the Dockerfile, the example TORCH_CUDA_ARCH_LIST
for flashinfer
in the comment block on lines 246-255 (specifically line 247, which currently shows '7.5 8.0 8.9 9.0a 10.0a'
) could be updated to include 12.0
. This would help future maintainers by reflecting the architectures now typically compiled for flashinfer
due to this PR's changes. Since this comment block is outside the diff, this is a suggestion for general consideration.
What's the new wheel size? :-) |
The wheel is 365MB! |
Sounds awesome! I'll try to confirm. Currently building the whole thing on my desktop, it'll take a while:
|
Do you mean for SM 120 (
Unfortunately we still can't update the defaults of the Dockerfile to include SM120, without touching anything else, because it'd be applied to building the CUDA 12.8 wheel here, too, and Pypi's limit of currently 400 MB is too low (even increasing it to 800 MB would not be enough).
I prefer solution 1.2. What do you guys think? @mgoin |
Hey @cyril23 thanks for the concern but the "build image" job in CI succeeds. This is the source of truth for wheel size and is now building for I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release? |
my wheels are bigger because I build it with USE_SCCACHE=0 and thus not building
|
Now I've verified that using
By the way I've further tested that using
In order to test it without using SCCACHE I've modified my Dockerfile as follows (I'll make an issue about it): diff --git a/docker/Dockerfile b/docker/Dockerfile
index 8d4375470..ae866edd0 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -112,6 +112,7 @@ ENV MAX_JOBS=${max_jobs}
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
+ARG CMAKE_BUILD_TYPE=Release
ARG USE_SCCACHE
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
ARG SCCACHE_REGION_NAME=us-west-2
@@ -129,7 +130,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
&& export SCCACHE_IDLE_TIMEOUT=0 \
- && export CMAKE_BUILD_TYPE=Release \
+ && export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} \
&& sccache --show-stats \
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
&& sccache --show-stats; \
@@ -143,6 +144,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Clean any existing CMake artifacts
rm -rf .deps && \
mkdir -p .deps && \
+ export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} && \
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
fi So let's merge! 👍 |
With the new FlashInfer wheel, I've tried it out with RTX 5090 (but just build it using edit: by the way the wheel size is pretty much the same like with the old FlashInfer version (compared to #19794 (comment))
|
@cyril23 could you please provide context as to why that would be the case regarding PTX? If you compile CUDA kernels with PTX, any earlier Compute Capability (CC) should be able to be compiled by newer GPUs, and they could use that. There would be some overhead (at least on first run) as PTX is compiled to cubin at runtime, and not targeting newer CC of that GPU would be less optimal (perf impact varies) but should still work. The only time this doesn't really workout is when the PTX is built with a newer version of CUDA than runtime uses. The builder image is using Beyond that, since you're also relying on PyTorch which bundles it's own CUDA libraries, depending on the CUDA release there you'll also have each library with embedded PTX/cubin. If they are lacking You've not mentioned what version of CUDA you're using at runtime, but it's possible that the compatibility issue was related to these caveats I've described. CUDA 12.8.0 can target $ docker run --rm -it nvidia/cuda:12.8.0-devel-ubuntu24.04
$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0 If there is some other compatibility caveat, I'd appreciate more details, as |
This is not advisable btw:
# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ That updates Instead you can replace If you try to use the image for runtime purposes with that, and the compat version of These compat packages are not intended to be used with newer versions of CUDA, you can't use CUDA 12.9 on the host and swap for an earlier CUDA 12.8 compat package. |
@polarathene You're right that generally it should run with 10.0+PTX (or any older version+PTX). And this is actually the first time I ran it without kernel problems, maybe before I had the wrong CUDA version or flashinfer was not compatible or who knows what I did wrong. Anyhow today I've built it from this branch Build: build-10ptx.log
Run:
Another try with cURL, same prompt: systeminfo.txt including nvidia smi etc. By the way similar gibberish when running vLLM with I've uploaded my build |
I don't own a Blackwell GPU so I cannot test. I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host, perhaps try version bumping the CUDA version in the If you have issues with either of those still, then try bring the version of CUDA from the builder down to an earlier CUDA version (I still seem some projects on CUDA 12.2 / 12.4 for their image builds). It would be helpful information for other projects trying to update their support for Blackwell, I've seen a few other project PRs where there is some reluctance to bump the builder stage CUDA. |
@cyril23 when you run the container, can you show this output? # This is a public image from the CI, but you could use your image `wurstdeploy/vllm:sm100ptxonly`:
# (Use entrypoint to switch to bash shell instead in container instead)
$ docker run --rm -it --runtime nvidia --gpus all --entrypoint bash \
public.ecr.aws/q9t5s3a7/vllm-release-repo:b6553be1bc75f046b00046a4ad7576364d03c835
# In the container run this command to see which `libcuda.so.1` is resolved:
# (this is the output without `--runtime nvidia --gpus all`, but it shouldn't have been cached)
$ ldconfig -p | grep -F 'libcuda.so'
libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so Does it resolve to that same If it does run ldconfig -p | grep -F libcuda.so
libcuda.so.1 (libc6,x86-64) => /usr/lib64/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib64/libcuda.so If it's already using I'm also not quite familiar why the runtime image is over 11GB (21GB uncompressed)? There's the CUDA libs from the image itself ( I think there are some linking mistakes as a result of this...? $ `ls -lh /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib
total 857M
-rw-r--r-- 1 root root 0 Jun 10 12:08 __init__.py
-rw-r--r-- 1 root root 111M Jun 10 12:08 libcublas.so.12
-rw-r--r-- 1 root root 745M Jun 10 12:08 libcublasLt.so.12
-rw-r--r-- 1 root root 737K Jun 10 12:08 libnvblas.so.12
# Notice how the library is resolving `libcublasLt.so.12` to the non-local one instead?
$ ldd /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublas.so.12
linux-vdso.so.1 (0x00007ffc0e7af000)
libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007571c6800000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007571ffefb000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007571ffef6000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007571ffef1000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007571ffe0a000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007571ffde8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007571c65d7000)
/lib64/ld-linux-x86-64.so.2 (0x00007571fff0a000)
# Different number of cubins for `sm_120`, these two lib copies are not equivalent:
$ cuobjdump --list-elf /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublasLt.so.12 | grep sm_120 | wc -l
1380
$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublasLt.so.12 | grep sm_120 | wc -l
1432 I wouldn't be surprised if the above contributes to some of the issues encountered? |
Here is a test with my
I'll do a build now with sm 89 + ptx, and test it tonight:
|
my 5090 is really looking forward to this being released :) keep up the good work |
@polarathene same gibberish with edit: I've pushed this build 8.9+ptx too: edit: I've tested
edit: I can try building edit: result of
|
I've tried building with
I've tried building with
Not sure what else to test or how to modify the Dockerfile accordingly to make it work. But I think we should make a separate issue about that. Anyway my takeaway is that building with PTX is not worth it because in order to support a new GPU gen, so many parameters must align (CUDA toolkit and host's GPU driver must match, PyTorch, probably some libraries and modules, CUDA base image), and even then it might result in gibberish output, so I think we should just omit the |
got this warning:
|
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Now that #19336 has landed, maybe we can add SM 12.0 without going over the 400MB wheel limit
EDIT: The wheel is 365MB!