Skip to content

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 46 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
6302a7d
full_cudagraph support for FA2
fhl2000 Jun 20, 2025
d5943f0
fix Typing error: replace some list[int] to List[int]
fhl2000 Jun 20, 2025
4c6fc32
minor fix
fhl2000 Jun 20, 2025
7339260
fix the arch support in CMakeLists.txt to include 8.9
fhl2000 Jun 20, 2025
0be8df3
[CI][Neuron] Fail and exit on first error (#19622)
elaineyz Jun 20, 2025
f6f4e71
[Benchmark] Fix `Value of type "SampleRequest" is not indexable` (#18…
b8zhong Jun 20, 2025
376ce81
[Chore]: qwen3-moe-type-hints-mistake (#19860)
Xerxes-cn Jun 20, 2025
3c77ffa
[Bugfix] Enable PP with AITER+V1 (#19822)
qli88 Jun 20, 2025
e12a111
[Bugfix][Ray] Set the cuda context eagerly in the ray worker (#19583)
kouroshHakha Jun 20, 2025
5f24762
[Misc] update cuda version (#19526)
reidliu41 Jun 20, 2025
1c4333d
[Misc] refactor example - openai_transcription_client (#19851)
reidliu41 Jun 20, 2025
4c1e40c
[Kernel] correct cpu worker function parameter type (#19745)
andyxning Jun 20, 2025
aa2ce41
[Fix] import regex instead of re (#19875)
tdoublep Jun 20, 2025
7047d65
[Model] GPT2ForSequenceClassification model (#19663)
nie3e Jun 20, 2025
672ea2a
[custom_op][vllm-plugin] update custom_op class to use op_registry (#…
xuechendi Jun 20, 2025
c90f70b
Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
vladmihailescu Jun 20, 2025
a56ec6f
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tes…
bigPYJ1151 Jun 20, 2025
e4c65a3
FlashInfer full cuda graoh support
fhl2000 Jun 23, 2025
35b8e96
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (#19749)
andyxning Jun 20, 2025
af35cf5
[Misc] Clean up useless code (#19889)
wangxiyuan Jun 20, 2025
cfbaaff
Fix: Check the type of params to be a Sequence not list. (#19910)
rabinadk1 Jun 20, 2025
c0079d4
[Bugfix] Fix bnb 8bit model weights loading (#19917)
Isotr0py Jun 21, 2025
66cf208
[New model support]Support Tarsier2 (#19887)
princepride Jun 21, 2025
381d959
[doc] add contact us in community (#19922)
reidliu41 Jun 21, 2025
aff9702
[Multimodal] Optimize Qwen2/2.5-VL startup time (#19756)
WoosukKwon Jun 21, 2025
c274674
[Docs] Add GPT2ForSequenceClassification to supported models in docs …
nie3e Jun 21, 2025
4da0a1d
[Misc] add vllm_config in __init__ (#19866)
andyxning Jun 22, 2025
7a185b2
[MISC] add cpu_kvcache_space_bytes to CacheConfig (#19812)
andyxning Jun 22, 2025
b100107
[Benchmark] fix request loss if "ping" is returned (#19535)
sywangyi Jun 22, 2025
3439983
[CI/Build] Auto tag perf benchmarks related PRs (#19943)
22quinn Jun 22, 2025
9ff2bb1
[doc] use snippets for contact us (#19944)
reidliu41 Jun 22, 2025
ed3cba2
[Misc] Update model-specific PR tagging (#19949)
ywang96 Jun 22, 2025
2362d8b
[Misc] Simplify vllm bench cli subcommand implementation (#19948)
yeqcharlotte Jun 22, 2025
78a9270
[Chore] dedup logs (#19955)
aarnphm Jun 22, 2025
2c00d23
[BugFix] Add an env to disable moe chunking to work around compile in…
yeqcharlotte Jun 22, 2025
708948b
[Perf][CLI] Improve overall startup time (#19941)
aarnphm Jun 22, 2025
6dc86aa
[Core] feat: Implement Priority Scheduling in V1 Engine (#19057)
amitm02 Jun 23, 2025
cba9f52
[Misc] Configurable timeout for execute_model RPC calls via env var (…
jinqinn Jun 23, 2025
ceba56c
Fix(models/siglip): Add compatibility for Gemma models quantized by l…
Flink-ddd Jun 23, 2025
de1b605
[doc] Fold long code blocks to improve readability (#19926)
reidliu41 Jun 23, 2025
016f49d
[P/D][NixlConnector] Support `tp_size > num_kv_heads` deployments (#1…
NickLucche Jun 23, 2025
d061382
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned u…
lk-chen Jun 23, 2025
d29077a
[Doc] Update V1 status for decoder-only embedding models (#19952)
Isotr0py Jun 23, 2025
fc38dc1
[doc] use MkDocs collapsible blocks - supplement (#19973)
reidliu41 Jun 23, 2025
b9694f4
fix correctness of Flashinfer attention on peicewise cudagraph using …
fhl2000 Jun 23, 2025
de50df0
Minor adjustment of capturing size
fhl2000 Jun 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .buildkite/scripts/hardware_ci/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,11 @@ docker run --rm -it --device=/dev/neuron0 --network bridge \
--name "${container_name}" \
${image_name} \
/bin/bash -c "
set -e; # Exit on first error
python3 /workspace/vllm/examples/offline_inference/neuron.py;
python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys;
for f in /workspace/vllm/tests/neuron/2_core/*.py; do
echo 'Running test file: '$f;
echo \"Running test file: \$f\";
python3 -m pytest \$f -v --capture=tee-sys;
done
"
9 changes: 9 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,15 @@ steps:
commands:
- pytest -v -s prefix_caching


- label: Platform Tests (CUDA)
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/cuda
commands:
- pytest -v -s cuda/test_cuda_context.py

- label: Samplers Test # 36min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
Expand Down
15 changes: 14 additions & 1 deletion .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ pull_request_rules:
- files~=^vllm/entrypoints/openai/tool_parsers/llama.*\.py
- files~=^vllm/model_executor/models/.*llama.*\.py
- files~=^vllm/transformers_utils/configs/.*llama.*\.py
- title~=(?i)llama
actions:
label:
add:
Expand All @@ -65,6 +66,19 @@ pull_request_rules:
add:
- multi-modality

- name: label-performance
description: Automatically apply performance label
conditions:
- or:
- files~=^benchmarks/
- files~=^vllm/benchmarks/
- files~=^tests/benchmarks/
- files~=^\.buildkite/nightly-benchmarks/
actions:
label:
add:
- performance

- name: label-qwen
description: Automatically apply qwen label
conditions:
Expand All @@ -74,7 +88,6 @@ pull_request_rules:
- files~=^vllm/model_executor/models/.*qwen.*\.py
- files~=^vllm/reasoning/.*qwen.*\.py
- title~=(?i)Qwen
- body~=(?i)Qwen
actions:
label:
add:
Expand Down
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,11 @@ repos:
entry: python tools/check_spdx_header.py
language: python
types: [python]
- id: check-root-lazy-imports
name: Check root lazy imports
entry: python tools/check_init_lazy_imports.py
language: python
types: [python]
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
# 9.0 for latest bf16 atomicAdd PTX
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.7;9.0+PTX" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.7;8.9;9.0+PTX" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)

#
Expand Down Expand Up @@ -684,7 +684,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")

list(APPEND VLLM_MOE_EXT_SRC "${VLLM_MOE_WNA16_SRC}")
# 9.0 for latest bf16 atomicAdd PTX
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.7;9.0+PTX" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.7;8.9;9.0+PTX" "${CUDA_ARCHS}")
if (MARLIN_MOE_ARCHS)

#
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,11 +154,13 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs

## Contact Us

<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
<!-- --8<-- [end:contact-us] -->

## Media Kit

Expand Down
8 changes: 7 additions & 1 deletion benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,8 +404,14 @@ async def async_request_openai_chat_completions(
chunk_bytes = chunk_bytes.strip()
if not chunk_bytes:
continue
chunk_bytes = chunk_bytes.decode("utf-8")
# NOTE: SSE comments (often used as pings) start with a colon.
# These are not JSON data payload and should be skipped.
if chunk_bytes.startswith(":"):
continue

chunk = chunk_bytes.removeprefix("data: ")

chunk = chunk_bytes.decode("utf-8").removeprefix("data: ")
if chunk != "[DONE]":
timestamp = time.perf_counter()
data = json.loads(chunk)
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def run_vllm(
assert lora_requests is None, "BeamSearch API does not support LoRA"
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests.
output_len = requests[0][2]
output_len = requests[0].expected_output_len
for request in requests:
assert request.expected_output_len == output_len
start = time.perf_counter()
Expand Down
6 changes: 3 additions & 3 deletions docs/ci/update_pytorch_version.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ source to unblock the update process.
### FlashInfer
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):

```
```bash
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
export FLASHINFER_ENABLE_SM90=1
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
Expand All @@ -105,14 +105,14 @@ team if you want to get the package published there.
### xFormers
Similar to FlashInfer, here is how to build and install xFormers from source:

```
```bash
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
```

### Mamba

```
```bash
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
```

Expand Down
49 changes: 22 additions & 27 deletions docs/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}

Start the vLLM OpenAI Compatible API server.

Examples:
??? Examples

```bash
# Start with a model
vllm serve meta-llama/Llama-2-7b-hf
```bash
# Start with a model
vllm serve meta-llama/Llama-2-7b-hf

# Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100
# Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100

# Check with --help for more options
# To list all groups
vllm serve --help=listgroup
# Check with --help for more options
# To list all groups
vllm serve --help=listgroup

# To view a argument group
vllm serve --help=ModelConfig
# To view a argument group
vllm serve --help=ModelConfig

# To view a single argument
vllm serve --help=max-num-seqs
# To view a single argument
vllm serve --help=max-num-seqs

# To search by keyword
vllm serve --help=max
```
# To search by keyword
vllm serve --help=max
```

## chat

Generate chat completions via the running API server.

Examples:

```bash
# Directly connect to localhost API without arguments
vllm chat
Expand All @@ -60,8 +58,6 @@ vllm chat --quick "hi"

Generate text completions based on the given prompt via the running API server.

Examples:

```bash
# Directly connect to localhost API without arguments
vllm complete
Expand All @@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm complete --quick "The future of AI is"
```

</details>

## bench

Run benchmark tests for latency online serving throughput and offline inference throughput.
Expand All @@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}

Benchmark the latency of a single batch of requests.

Example:

```bash
vllm bench latency \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -104,8 +100,6 @@ vllm bench latency \

Benchmark the online serving throughput.

Example:

```bash
vllm bench serve \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -120,8 +114,6 @@ vllm bench serve \

Benchmark offline inference throughput.

Example:

```bash
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -143,7 +135,8 @@ vllm collect-env

Run batch prompts and write results to file.

Examples:
<details>
<summary>Examples</summary>

```bash
# Running with a local file
Expand All @@ -159,6 +152,8 @@ vllm run-batch \
--model meta-llama/Meta-Llama-3-8B-Instruct
```

</details>

## More Help

For detailed options of any subcommand, use:
Expand Down
6 changes: 6 additions & 0 deletions docs/community/contact_us.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
title: Contact Us
---
[](){ #contactus }

--8<-- "README.md:contact-us"
58 changes: 31 additions & 27 deletions docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel

llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
)
```
??? Code

```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel

llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
)
```

You can disable graph capturing completely via the `enforce_eager` flag:

Expand Down Expand Up @@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.

Here are some examples:

```python
from vllm import LLM
??? Code

# Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
})

# Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B",
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```
```python
from vllm import LLM

# Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
})

# Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B",
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```
8 changes: 5 additions & 3 deletions docs/configuration/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:

All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).

```python
--8<-- "vllm/envs.py:env-vars-definition"
```
??? Code

```python
--8<-- "vllm/envs.py:env-vars-definition"
```
Loading