Skip to content

Update Dockerfile to build for Blackwell #18095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 17, 2025

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented May 13, 2025

Updates the docker to build wheels for blackwell (SM 10.0) and include the latest flashinfer for performance blackwell attention support (FIX #17325). We didn't include SM 12.0 for now because of wheel size concerns.

Updates to latest flashinfer main as of 5/15 since there isn't a release yet: flashinfer-ai/flashinfer@e00e8ce

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label May 13, 2025
@simon-mo simon-mo added this to the v0.9.0 milestone May 13, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
@chenyang78
Copy link
Contributor

chenyang78 commented May 13, 2025

Thanks for the prompt fix, @mgoin ! Attaching the eval results (perf results in a separate comment below) with this fix on GB200. All the experiments were conducted with the latest flashinfer commit (25fb40) plus cherry-picking vllm PR (#15777).

Evals:

Llama-3.1B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7498|±  |0.0119|

Llama-3.2-1B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto 
...
vllm (pretrained=meta-llama/Llama-3.2-1B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3321|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.3321|±  | 0.013|

Qwen2.5-7B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-7B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8226|±  |0.0105|
|     |       |strict-match    |     5|exact_match|↑  |0.7839|±  |0.0113|

QwQ-32B-FP8

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4496|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.7346|±  |0.0122|

@chenyang78
Copy link
Contributor

Some perf numbers for the FlashInfer backend and FlashAttention V2 backend on GB200, using the same settings as the evals above.

Llama 8B at 1024/128 input/output tokens:

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1024 --output-len 128

Throughput: 50.68 requests/s, 58330.33 total tokens/s, 6486.70 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1024 --output-len 128

Throughput: 46.47 requests/s, 53479.79 total tokens/s, 5948.69 output tokens/s

Llama 8B at 1000/1000 input/output tokens:

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 10.01 requests/s, 20002.62 total tokens/s, 10006.33 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 11.00 requests/s, 21982.08 total tokens/s, 10999.94 output tokens/s

QwQ 32B FP8-dynamic TP=2 at 1000/1000 input/output tokens

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model RedHatAI/QwQ-32B-FP8-dynamic --tensor-parallel-size=2 --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 7.46 requests/s, 14892.90 total tokens/s, 7455.20 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model RedHatAI/QwQ-32B-FP8-dynamic --tensor-parallel-size=2 --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 6.60 requests/s, 13185.27 total tokens/s, 6600.92 output tokens/s

@mgoin
Copy link
Member Author

mgoin commented May 14, 2025

It seems if we build with SM 10.0+12.0, we increase the wheel size to 450MB

[2025-05-13T23:00:45Z] #32 0.715 Not allowed: Wheel dist/vllm-0.8.5.dev650+g114a0f311-cp38-abi3-linux_x86_64.whl is larger (450.40 MB) than the limit (400 MB).

Signed-off-by: mgoin <mgoin64@gmail.com>
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.2.post1" ; \
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' \
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@948a14622bd624773918d738b0f66137a9ac4784" ; \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that possible to based on a release tag than a commit? This will be very hard for different users to consume as dependencies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't one available atm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That contains Blackwell kernel, which was a reason for such upgrade

@simon-mo
Copy link
Collaborator

Extended time out build here https://buildkite.com/vllm/ci/builds/20161/steps

@simon-mo
Copy link
Collaborator

simon-mo commented May 15, 2025

FAILED samplers/test_rejection_sampler.py::test_compare_nonflashinfer_backend[cuda:0-1-30000-6] - RuntimeError: target_probs must be a 3D tensor

Let's try to get #15777 in

@mgoin mgoin changed the title Update FlashInfer Update Dockerfile to build for Blackwell May 15, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025
@simon-mo
Copy link
Collaborator

Longer time out build: https://buildkite.com/vllm/ci/builds/20203/steps

@simon-mo
Copy link
Collaborator

@mgoin, sampler fixes merged. Can you resolve the conflict?

Copy link

mergify bot commented May 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 16, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
@mergify mergify bot removed the needs-rebase label May 16, 2025
@simon-mo simon-mo merged commit dcfe952 into vllm-project:main May 17, 2025
86 of 93 checks passed
markmc added a commit to markmc/vllm that referenced this pull request May 21, 2025
This reverts commit dcfe952.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: minpeter <kali2005611@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Integrate FlashInfer Blackwell kernels
4 participants