Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 06 23 #329

Merged
merged 119 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
5d52fa5
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs…
KuntaiDu Jun 14, 2024
cab4a5d
[CI/Build] Disable LLaVA-NeXT CPU test (#5529)
DarkLight1337 Jun 14, 2024
923d05a
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
tlrmchlsmth Jun 14, 2024
34467ee
[Misc] Fix arg names (#5524)
AllenDou Jun 14, 2024
deee747
[ Misc ] Rs/compressed tensors cleanup (#5432)
robertgshaw2-neuralmagic Jun 14, 2024
0ccb117
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
tlrmchlsmth Jun 14, 2024
4464401
[mis] fix flaky test of test_cuda_device_count_stateless (#5546)
youkaichao Jun 14, 2024
28d0d6d
[Core] Remove duplicate processing in async engine (#5525)
DarkLight1337 Jun 14, 2024
f0e02ac
[misc][distributed] fix benign error in `is_in_the_same_node` (#5512)
youkaichao Jun 14, 2024
d0a3026
[Docs] Add ZhenFund as a Sponsor (#5548)
simon-mo Jun 14, 2024
33edc9b
[Doc] Update documentation on Tensorizer (#5471)
sangstar Jun 14, 2024
5fffeb8
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
tdoublep Jun 14, 2024
65419f4
[Bugfix] Fix typo in Pallas backend (#5558)
WoosukKwon Jun 14, 2024
dfd2b2e
[Core][Distributed] improve p2p cache generation (#5528)
youkaichao Jun 14, 2024
d464106
Add ccache to amd (#5555)
simon-mo Jun 15, 2024
80b908f
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
leiwen83 Jun 15, 2024
0393d45
[mypy] Enable type checking for test directory (#5017)
DarkLight1337 Jun 15, 2024
32d5ecc
[CI/Build] Test both text and token IDs in batched OpenAI Completions…
DarkLight1337 Jun 15, 2024
6f3169a
[misc] Do not allow to use lora with chunked prefill. (#5538)
rkooo567 Jun 15, 2024
beb3b21
add gptq_marlin test for bug report https://github.com/vllm-project/v…
alexm-neuralmagic Jun 15, 2024
31f38f3
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
njhill Jun 15, 2024
ec68cd1
[Fix] Correct OpenAI batch response format (#5554)
zifeitong Jun 15, 2024
dc8789d
Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518)
Yard1 Jun 16, 2024
681de21
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
mgoin Jun 16, 2024
77a5f36
[build][misc] limit numpy version (#5582)
youkaichao Jun 16, 2024
9c77244
[Doc] add debugging tips for crash and multi-node debugging (#5581)
youkaichao Jun 17, 2024
f968328
Fix w8a8 benchmark and add Llama-3-8B (#5562)
comaniac Jun 17, 2024
b0abad9
[Model] Rename Phi3 rope scaling type (#5595)
garg-amit Jun 17, 2024
4b84959
Correct alignment in the seq_len diagram. (#5592)
CharlesRiggins Jun 17, 2024
9cfb1d7
[Kernel] `compressed-tensors` marlin 24 support (#5435)
dsikka Jun 17, 2024
61f421b
[Misc] use AutoTokenizer for benchmark serving when vLLM not installe…
zhyncs Jun 17, 2024
dceff94
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
jikunshang Jun 17, 2024
e830048
[CI/BUILD] Support non-AVX512 vLLM building and testing (#5574)
DamonFool Jun 17, 2024
a212392
[CI] the readability of benchmarking and prepare for dashboard (#5571)
KuntaiDu Jun 17, 2024
bc2be04
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604)
youkaichao Jun 17, 2024
5eb3526
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584)
youkaichao Jun 17, 2024
17fd0ba
[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142)
bfontain Jun 17, 2024
7a58e54
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606)
zifeitong Jun 17, 2024
dbf0e91
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of…
sroy745 Jun 18, 2024
18c566f
[Model] Initialize Phi-3-vision support (#4986)
Isotr0py Jun 18, 2024
69fa6ed
[Kernel] Add punica dimensions for Granite 13b (#5559)
joerunde Jun 18, 2024
1b39fc2
[misc][typo] fix typo (#5620)
youkaichao Jun 18, 2024
f691b45
[Misc] Fix typo (#5618)
DarkLight1337 Jun 18, 2024
5abb0c8
[CI] Avoid naming different metrics with the same name in performance…
KuntaiDu Jun 18, 2024
f355997
[bugfix][distributed] improve p2p capability test (#5612)
youkaichao Jun 18, 2024
1343cd0
[Misc] Remove import from transformers logging (#5625)
CatherineSue Jun 18, 2024
021cfdb
[CI/Build][Misc] Update Pytest Marker for VLMs (#5623)
ywang96 Jun 18, 2024
70baf49
[ci] Deprecate original CI template (#5624)
khluu Jun 18, 2024
be2f123
[Misc] Add OpenTelemetry support (#4687)
ronensc Jun 18, 2024
0008715
[Misc] Add channel-wise quantization support for w8a8 dynamic per tok…
dsikka Jun 18, 2024
14a7620
[ci] Setup Release pipeline and build release wheels with cache (#5610)
khluu Jun 18, 2024
50c2ca9
[Model] LoRA support added for command-r (#5178)
sergey-tinkoff Jun 18, 2024
3d24777
[Bugfix] Fix for inconsistent behaviour related to sampling and repet…
tdoublep Jun 18, 2024
c5ef2f9
[Doc] Added cerebrium as Integration option (#5553)
milo157 Jun 18, 2024
010f2e8
[Bugfix] Fix CUDA version check for mma warning suppression (#5642)
tlrmchlsmth Jun 18, 2024
a8b75a4
[Bugfix] Fix w8a8 benchmarks for int8 case (#5643)
tlrmchlsmth Jun 19, 2024
a0d8ed2
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628)
ShukantPal Jun 19, 2024
e4d2b6e
[Bugfix] Added test for sampling repetition penalty bug. (#5659)
tdoublep Jun 19, 2024
cb46cfe
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate…
hongxiayang Jun 19, 2024
8f72d50
[misc][distributed] use 127.0.0.1 for single-node (#5619)
youkaichao Jun 19, 2024
b081ff9
[Model] Add FP8 kv cache for Qwen2 (#5656)
mgoin Jun 19, 2024
784aa72
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684)
Isotr0py Jun 19, 2024
a799171
[Misc]Add param max-model-len in benchmark_latency.py (#5629)
DearPlanet Jun 19, 2024
436aaf9
[CI/Build] Add tqdm to dependencies (#5680)
DarkLight1337 Jun 19, 2024
d33025c
[ci] Add A100 queue into AWS CI template (#5648)
khluu Jun 19, 2024
cf5889f
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg…
mgoin Jun 19, 2024
88396ae
[ci][distributed] add tests for custom allreduce (#5689)
youkaichao Jun 19, 2024
8ff473a
[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654)
zifeitong Jun 19, 2024
4f4cea6
[Doc] Update docker references (#5614)
rafvasq Jun 19, 2024
0e8e31e
[Misc] Add per channel support for static activation quantization; up…
dsikka Jun 19, 2024
1ccd388
[ci] Limit num gpus if specified for A100 (#5694)
khluu Jun 19, 2024
330aa1b
[Misc] Improve conftest (#5681)
DarkLight1337 Jun 20, 2024
df3ae01
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703)
ywang96 Jun 20, 2024
7d85753
[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
varun-sundar-rabindranath Jun 20, 2024
b6ec1d5
[Model] Port over CLIPVisionModel for VLMs (#5591)
ywang96 Jun 20, 2024
db7892d
[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
varun-sundar-rabindranath Jun 20, 2024
51dfab0
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS ke…
tlrmchlsmth Jun 20, 2024
c477239
[Frontend] Add FlexibleArgumentParser to support both underscore and …
mgoin Jun 20, 2024
5ccb86c
[distributed][misc] use fork by default for mp (#5669)
youkaichao Jun 21, 2024
b05443a
[Model] MLPSpeculator speculative decoding support (#4947)
JRosenkranz Jun 21, 2024
1996acf
[Kernel] Add punica dimension for Qwen2 LoRA (#5441)
jinzhen-lin Jun 21, 2024
1699d33
[BugFix] Fix test_phi3v.py (#5725)
CatherineSue Jun 21, 2024
e4f1a4e
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
jeejeelee Jun 21, 2024
3e3c8d9
[Core][Distributed] add shm broadcast (#5399)
youkaichao Jun 21, 2024
01369a0
[Kernel][CPU] Add Quick `gelu` to CPU (#5717)
ywang96 Jun 21, 2024
733cf30
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
07cd29d
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
2e2140f
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
0bec3f6
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
3595200
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
1a6c6dd
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
a7dccd6
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
960a022
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
dc211cd
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
860a1d6
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
d7f0ece
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
e484da4
remove vllm-runnner-nm
robertgshaw2-neuralmagic Jun 23, 2024
683f309
formatted
robertgshaw2-neuralmagic Jun 23, 2024
3b3a92c
fix is_xpu
robertgshaw2-neuralmagic Jun 24, 2024
e0c0530
fix lm eval
robertgshaw2-neuralmagic Jun 24, 2024
01d4f34
fix format
robertgshaw2-neuralmagic Jun 24, 2024
616fce8
Merge branch 'main' into upstream-sync-2024-06-23
robertgshaw2-neuralmagic Jun 24, 2024
3c5a7f5
remove flaky gptq models
robertgshaw2-neuralmagic Jun 24, 2024
71f60a8
fix import error
robertgshaw2-neuralmagic Jun 24, 2024
e960ebb
format
robertgshaw2-neuralmagic Jun 24, 2024
3297247
fix lm-eval
robertgshaw2-neuralmagic Jun 24, 2024
dcdf4da
Merge branch 'main' into upstream-sync-2024-06-23
robertgshaw2-neuralmagic Jun 24, 2024
0dd1848
format
robertgshaw2-neuralmagic Jun 24, 2024
de06faa
Merge branch 'upstream-sync-2024-06-23' of https://github.com/neuralm…
robertgshaw2-neuralmagic Jun 24, 2024
cdf52bf
fix test tracing
robertgshaw2-neuralmagic Jun 25, 2024
f2d2794
Merge branch 'main' into upstream-sync-2024-06-23
robertgshaw2-neuralmagic Jun 25, 2024
431054d
fixed test tracing deps
robertgshaw2-neuralmagic Jun 25, 2024
9d7b7b5
added skipping logic
robertgshaw2-neuralmagic Jun 25, 2024
3a75e15
handled skipping
robertgshaw2-neuralmagic Jun 25, 2024
c9d1b9e
updated to run only on solo
robertgshaw2-neuralmagic Jun 25, 2024
973d9d0
format
robertgshaw2-neuralmagic Jun 25, 2024
c44802e
clean up
robertgshaw2-neuralmagic Jun 25, 2024
9c15fe1
cleanup
robertgshaw2-neuralmagic Jun 25, 2024
727077f
format
robertgshaw2-neuralmagic Jun 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# vLLM benchmark suite

## Introduction

This directory contains the performance benchmarking CI for vllm.
The goal is to help developers know the impact of their PRs on the performance of vllm.

This benchmark will be *triggered* upon:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.

**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.

**Benchmarking Duration**: about 1hr.

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.


## Configuring the workload

The benchmarking workload contains three parts:
- Latency tests in `latency-tests.json`.
- Throughput tests in `throughput-tests.json`.
- Serving tests in `serving-tests.json`.

See [descriptions.md](tests/descriptions.md) for detailed descriptions.

### Latency test

Here is an example of one test inside `latency-tests.json`:

```json
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
]
```

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.


### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
]
```

Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

## Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
62 changes: 62 additions & 0 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
steps:
- label: "Wait for container to be ready"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
containers:
- image: badouralix/curl-jq
command:
- sh
- .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "A100 Benchmark"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100: NVIDIA SMI"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# propagate-uid-gid: false
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

3 changes: 2 additions & 1 deletion .buildkite/nightly-benchmarks/kickoff-pipeline.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env bash

# NOTE(simon): this script runs inside a buildkite agent with CPU only access.
set -euo pipefail

# Install system packages
Expand All @@ -23,4 +24,4 @@ if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then
fi

# Upload sample.yaml
buildkite-agent pipeline upload .buildkite/nightly-benchmarks/sample.yaml
buildkite-agent pipeline upload .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Loading
Loading