Skip to content

Commit fd11ac1

Browse files
committed
Merge remote-tracking branch 'upstream/main' into skip-lm-head
* upstream/main: (66 commits) [Bugfix] Fix PaliGemma MMP (vllm-project#6930) [TPU] Fix greedy decoding (vllm-project#6933) [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848) [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904) [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921) [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908) [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753) [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786) [TPU] Support tensor parallelism in async llm engine (vllm-project#6891) [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901) [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532) [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677) [Model] Initialize support for InternVL2 series models (vllm-project#6514) [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871) Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863) [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795) [TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-project#6856) [Docs] Add RunLLM chat widget (vllm-project#6857) [Model] Initial support for BLIP-2 (vllm-project#5920) [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860) ...
2 parents 84a7b46 + c66c7f8 commit fd11ac1

File tree

174 files changed

+7934
-1636
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+7934
-1636
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nvidia/Minitron-4B-Base -b auto -l 1000 -f 5 -t 1
2+
model_name: "nvidia/Minitron-4B-Base"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.252
8+
- name: "exact_match,flexible-extract"
9+
value: 0.252
10+
limit: 1000
11+
num_fewshot: 5
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
2+
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.578
8+
- name: "exact_match,flexible-extract"
9+
value: 0.585
10+
limit: 1000
11+
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
44
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
55
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
66
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
7+
Minitron-4B-Base.yaml
78
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
9+
Qwen2-1.5B-Instruct-FP8W8.yaml

.buildkite/nightly-benchmarks/README.md

Lines changed: 64 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,30 +3,51 @@
33

44
## Introduction
55

6-
This directory contains the performance benchmarking CI for vllm.
7-
The goal is to help developers know the impact of their PRs on the performance of vllm.
6+
This directory contains two sets of benchmark for vllm.
7+
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
8+
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
89

9-
This benchmark will be *triggered* upon:
10+
11+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12+
13+
14+
## Performance benchmark quick overview
15+
16+
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
17+
18+
**Benchmarking Duration**: about 1hr.
19+
20+
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
21+
22+
23+
## Nightly benchmark quick overview
24+
25+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
26+
27+
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
28+
29+
**Benchmarking Duration**: about 3.5hrs.
30+
31+
32+
33+
## Trigger the benchmark
34+
35+
Performance benchmark will be triggered when:
1036
- A PR being merged into vllm.
1137
- Every commit for those PRs with `perf-benchmarks` label.
1238

13-
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.
39+
Nightly benchmark will be triggered when:
40+
- Every commit for those PRs with `nightly-benchmarks` label.
1441

15-
**Benchmarking Duration**: about 1hr.
1642

17-
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.
1843

1944

20-
## Configuring the workload
45+
## Performance benchmark details
2146

22-
The benchmarking workload contains three parts:
23-
- Latency tests in `latency-tests.json`.
24-
- Throughput tests in `throughput-tests.json`.
25-
- Serving tests in `serving-tests.json`.
47+
See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
2648

27-
See [descriptions.md](tests/descriptions.md) for detailed descriptions.
2849

29-
### Latency test
50+
#### Latency test
3051

3152
Here is an example of one test inside `latency-tests.json`:
3253

@@ -54,12 +75,12 @@ Note that the performance numbers are highly sensitive to the value of the param
5475
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
5576

5677

57-
### Throughput test
78+
#### Throughput test
5879
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
5980

6081
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
6182

62-
### Serving test
83+
#### Serving test
6384
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
6485

6586
```
@@ -96,9 +117,36 @@ The number of this test is less stable compared to the delay and latency benchma
96117

97118
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
98119

99-
## Visualizing the results
120+
#### Visualizing the results
100121
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
101122
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
102123
If you do not see the table, please wait till the benchmark finish running.
103124
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
104125
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
126+
127+
128+
129+
## Nightly test details
130+
131+
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
132+
133+
134+
#### Workflow
135+
136+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
137+
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
138+
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
139+
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
140+
141+
#### Nightly tests
142+
143+
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
144+
145+
#### Docker containers
146+
147+
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
148+
149+
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
150+
151+
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
152+

.buildkite/run-cpu-test.sh

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,38 @@
33
set -ex
44

55
# Try building the docker image
6-
docker build -t cpu-test -f Dockerfile.cpu .
7-
docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
6+
numactl -C 48-95 -N 1 docker build -t cpu-test -f Dockerfile.cpu .
7+
numactl -C 48-95 -N 1 docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
88

99
# Setup cleanup
1010
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
1111
trap remove_docker_container EXIT
1212
remove_docker_container
1313

14-
# Run the image
14+
# Run the image, setting --shm-size=4g for tensor parallel.
1515
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
16-
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test
16+
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
1717
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
18-
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test-avx2 cpu-test-avx2
18+
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2
1919

2020
# offline inference
21-
docker exec cpu-test bash -c "python3 examples/offline_inference.py"
2221
docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"
2322

2423
# Run basic model test
25-
docker exec cpu-test bash -c "cd tests;
24+
docker exec cpu-test bash -c "
2625
pip install pytest Pillow protobuf
27-
cd ../
28-
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py" # Mamba on CPU is not supported
26+
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
27+
28+
# online inference
29+
docker exec cpu-test bash -c "
30+
export VLLM_CPU_KVCACHE_SPACE=10
31+
export VLLM_CPU_OMP_THREADS_BIND=48-92
32+
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
33+
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
34+
python3 benchmarks/benchmark_serving.py \
35+
--backend vllm \
36+
--dataset-name random \
37+
--model facebook/opt-125m \
38+
--num-prompts 20 \
39+
--endpoint /v1/completions \
40+
--tokenizer facebook/opt-125m"

.buildkite/test-pipeline.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,10 @@ steps:
1717
- pytest -v -s test_utils.py # Utils
1818
- pytest -v -s worker # Worker
1919

20-
- label: Tensorizer, Metrics, Tracing Test
20+
- label: Metrics, Tracing Test
2121
fast_check: true
2222
fast_check_only: true
2323
commands:
24-
- apt-get install -y curl libsodium23 && pytest -v -s tensorizer_loader # Tensorizer
2524
- pytest -v -s metrics # Metrics
2625
- "pip install \
2726
opentelemetry-sdk \
@@ -141,14 +140,13 @@ steps:
141140
working_dir: "/vllm-workspace/examples"
142141
mirror_hardwares: [amd]
143142
commands:
144-
# install aws cli for llava_example.py
145143
# install tensorizer for tensorize_vllm_model.py
146144
- pip install awscli tensorizer
147145
- python3 offline_inference.py
148146
- python3 cpu_offload.py
149147
- python3 offline_inference_with_prefix.py
150148
- python3 llm_engine_example.py
151-
- python3 llava_example.py
149+
- python3 offline_inference_vision_language.py
152150
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
153151

154152
- label: Inputs Test
@@ -221,6 +219,7 @@ steps:
221219

222220
- label: Tensorizer Test
223221
#mirror_hardwares: [amd]
222+
fast_check: true
224223
commands:
225224
- apt-get install -y curl libsodium23
226225
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: Remove ready Label on notready Comment
2+
3+
on:
4+
issue_comment:
5+
types: [created]
6+
7+
jobs:
8+
add-ready-label:
9+
runs-on: ubuntu-latest
10+
if: github.event.issue.pull_request && contains(github.event.comment.body, '/notready')
11+
steps:
12+
- name: Remove ready label
13+
uses: actions/github-script@v5
14+
with:
15+
script: |
16+
github.rest.issues.removeLabel({
17+
owner: context.repo.owner,
18+
repo: context.repo.repo,
19+
issue_number: context.issue.number,
20+
name: 'ready'
21+
})
22+
env:
23+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.readthedocs.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ build:
1010

1111
sphinx:
1212
configuration: docs/source/conf.py
13+
fail_on_warning: true
1314

1415
# If using Sphinx, optionally build your docs in additional formats such as PDF
1516
formats:

Dockerfile.cpu

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
FROM ubuntu:22.04 AS cpu-test-1
44

5-
RUN apt-get update -y \
6-
&& apt-get install -y git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 \
5+
RUN apt-get update -y \
6+
&& apt-get install -y curl git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
77
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
88

99
# https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html
@@ -13,8 +13,9 @@ RUN pip install intel-openmp
1313

1414
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/lib/libiomp5.so:$LD_PRELOAD"
1515

16+
RUN echo 'ulimit -c 0' >> ~/.bashrc
1617

17-
RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.3.100%2Bgit0eb3473-cp310-cp310-linux_x86_64.whl
18+
RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.4.0%2Bgitfbaa4bc-cp310-cp310-linux_x86_64.whl
1819

1920
RUN pip install --upgrade pip \
2021
&& pip install wheel packaging ninja "setuptools>=49.4.0" numpy
@@ -25,7 +26,7 @@ COPY ./ /workspace/vllm
2526

2627
WORKDIR /workspace/vllm
2728

28-
RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
29+
RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/test/cpu
2930

3031
# Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ...
3132
ARG VLLM_CPU_DISABLE_AVX512

Dockerfile.rocm

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,10 @@ RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(whic
5353
# Install torch == 2.5.0 on ROCm
5454
RUN case "$(ls /opt | grep -Po 'rocm-[0-9]\.[0-9]')" in \
5555
*"rocm-6.1"*) \
56-
python3 -m pip uninstall -y torch torchaudio torchvision \
56+
python3 -m pip uninstall -y torch torchvision \
5757
&& python3 -m pip install --no-cache-dir --pre \
58-
torch==2.5.0.dev20240710 torchaudio==2.4.0.dev20240710 \
59-
torchvision==0.20.0.dev20240710 \
58+
torch==2.5.0.dev20240726 \
59+
torchvision==0.20.0.dev20240726 \
6060
--index-url https://download.pytorch.org/whl/nightly/rocm6.1;; \
6161
*) ;; esac
6262

@@ -127,13 +127,6 @@ FROM base AS final
127127
# Import the vLLM development directory from the build context
128128
COPY . .
129129

130-
# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
131-
# Manually remove it so that later steps of numpy upgrade can continue
132-
RUN case "$(which python3)" in \
133-
*"/opt/conda/envs/py_3.9"*) \
134-
rm -rf /opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy-1.20.3.dist-info/;; \
135-
*) ;; esac
136-
137130
# Package upgrades for useful functionality or to avoid dependency issues
138131
RUN --mount=type=cache,target=/root/.cache/pip \
139132
python3 -m pip install --upgrade numba scipy huggingface-hub[cli]

0 commit comments

Comments
 (0)