-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Update Dockerfile to build for Blackwell #18095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Dockerfile to build for Blackwell #18095
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: mgoin <mgoin64@gmail.com>
Thanks for the prompt fix, @mgoin ! Attaching the eval results (perf results in a separate comment below) with this fix on GB200. All the experiments were conducted with the latest flashinfer commit (25fb40) plus cherry-picking vllm PR (#15777). Evals: Llama-3.1B
Llama-3.2-1B
Qwen2.5-7B
QwQ-32B-FP8
|
Some perf numbers for the FlashInfer backend and FlashAttention V2 backend on GB200, using the same settings as the evals above. Llama 8B at 1024/128 input/output tokens:
Llama 8B at 1000/1000 input/output tokens:
QwQ 32B FP8-dynamic TP=2 at 1000/1000 input/output tokens
|
It seems if we build with SM 10.0+12.0, we increase the wheel size to 450MB
|
Signed-off-by: mgoin <mgoin64@gmail.com>
docker/Dockerfile
Outdated
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \ | ||
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.2.post1" ; \ | ||
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' \ | ||
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@948a14622bd624773918d738b0f66137a9ac4784" ; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that possible to based on a release tag than a commit? This will be very hard for different users to consume as dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't one available atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That contains Blackwell kernel, which was a reason for such upgrade
Extended time out build here https://buildkite.com/vllm/ci/builds/20161/steps |
Let's try to get #15777 in |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Longer time out build: https://buildkite.com/vllm/ci/builds/20203/steps |
Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin, sampler fixes merged. Can you resolve the conflict? |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
This reverts commit dcfe952. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
Updates the docker to build wheels for blackwell (SM 10.0) and include the latest flashinfer for performance blackwell attention support (FIX #17325). We didn't include SM 12.0 for now because of wheel size concerns.
Updates to latest flashinfer main as of 5/15 since there isn't a release yet: flashinfer-ai/flashinfer@e00e8ce