Skip to content

Commit cf7058b

Browse files
KuntaiDugarg-amit
authored andcommitted
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (vllm-project#7412)
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
1 parent 7b81659 commit cf7058b

18 files changed

+1152
-1276
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
2+
## Description
3+
4+
This file contains the downloading link for benchmarking results.
5+
6+
- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
7+
- [benchmarking results](artifact://results.zip)
8+
- [benchmarking code](artifact://nightly-benchmarks.zip)
9+
10+
Please download the visualization scripts in the post
11+
12+
13+
## Results reproduction
14+
15+
- Find the docker we use in `benchmarking pipeline`
16+
- Deploy the docker, and inside the docker:
17+
- Download `nightly-benchmarks.zip`.
18+
- In the same folder, run the following code
19+
```
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
26+
27+
And the results will be inside `./benchmarks/results`.
28+
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,39 @@
11

22
# Nightly benchmark
33

4-
The main goal of this benchmarking is two-fold:
5-
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
6-
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
7-
8-
9-
## Docker images
10-
11-
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
12-
- vllm/vllm-openai:v0.5.0.post1
13-
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
14-
- openmmlab/lmdeploy:v0.5.0
15-
- ghcr.io/huggingface/text-generation-inference:2.1
16-
17-
<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->
18-
19-
20-
## Hardware
21-
22-
One AWS node with 8x NVIDIA A100 GPUs.
23-
24-
25-
## Workload description
26-
27-
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
28-
29-
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
30-
- Output length: the corresponding output length of these 500 prompts.
31-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
32-
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
33-
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
34-
35-
<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
36-
37-
## Plots
38-
39-
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
40-
41-
<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >
42-
43-
## Results
44-
45-
{nightly_results_benchmarking_table}
4+
This benchmark aims to:
5+
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
6+
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
7+
8+
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
9+
10+
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
11+
12+
13+
## Setup
14+
15+
- Docker images:
16+
- vLLM: `vllm/vllm-openai:v0.6.2`
17+
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
18+
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
19+
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
20+
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
21+
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
22+
- Hardware
23+
- 8x Nvidia A100 GPUs
24+
- Workload:
25+
- Dataset
26+
- ShareGPT dataset
27+
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
28+
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
29+
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
30+
- Models: llama-3 8B, llama-3 70B.
31+
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
32+
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
33+
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
34+
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
35+
36+
# Known issues
37+
38+
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39+
- TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/nightly-pipeline.yaml

+87-11
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
1313

1414
common_container_settings: &common_container_settings
1515
command:
16-
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
16+
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
1717
resources:
1818
limits:
1919
nvidia.com/gpu: 8
@@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
3737

3838
steps:
3939
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
40-
- label: "A100 trt benchmark"
40+
41+
42+
43+
- label: "A100 vllm step 10"
4144
priority: 100
4245
agents:
4346
queue: A100
@@ -46,7 +49,21 @@ steps:
4649
podSpec:
4750
<<: *common_pod_spec
4851
containers:
49-
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
52+
- image: vllm/vllm-openai:v0.6.2
53+
<<: *common_container_settings
54+
55+
56+
57+
- label: "A100 sglang benchmark"
58+
priority: 100
59+
agents:
60+
queue: A100
61+
plugins:
62+
- kubernetes:
63+
podSpec:
64+
<<: *common_pod_spec
65+
containers:
66+
- image: lmsysorg/sglang:v0.3.2-cu121
5067
<<: *common_container_settings
5168

5269
- label: "A100 lmdeploy benchmark"
@@ -58,11 +75,13 @@ steps:
5875
podSpec:
5976
<<: *common_pod_spec
6077
containers:
61-
- image: openmmlab/lmdeploy:v0.5.0
78+
- image: openmmlab/lmdeploy:v0.6.1-cu12
6279
<<: *common_container_settings
63-
6480

65-
- label: "A100 vllm benchmark"
81+
82+
83+
84+
- label: "A100 trt llama-8B"
6685
priority: 100
6786
agents:
6887
queue: A100
@@ -71,10 +90,25 @@ steps:
7190
podSpec:
7291
<<: *common_pod_spec
7392
containers:
74-
- image: vllm/vllm-openai:latest
93+
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
7594
<<: *common_container_settings
95+
env:
96+
- name: VLLM_USAGE_SOURCE
97+
value: ci-test
98+
- name: HF_HOME
99+
value: /root/.cache/huggingface
100+
- name: VLLM_SOURCE_CODE_LOC
101+
value: /workspace/build/buildkite/vllm/performance-benchmark
102+
- name: HF_TOKEN
103+
valueFrom:
104+
secretKeyRef:
105+
name: hf-token-secret
106+
key: token
107+
- name: TEST_SELECTOR
108+
value: "llama8B"
76109

77-
- label: "A100 tgi benchmark"
110+
111+
- label: "A100 trt llama-70B"
78112
priority: 100
79113
agents:
80114
queue: A100
@@ -83,12 +117,54 @@ steps:
83117
podSpec:
84118
<<: *common_pod_spec
85119
containers:
86-
- image: ghcr.io/huggingface/text-generation-inference:2.1
120+
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
87121
<<: *common_container_settings
122+
env:
123+
- name: VLLM_USAGE_SOURCE
124+
value: ci-test
125+
- name: HF_HOME
126+
value: /root/.cache/huggingface
127+
- name: VLLM_SOURCE_CODE_LOC
128+
value: /workspace/build/buildkite/vllm/performance-benchmark
129+
- name: HF_TOKEN
130+
valueFrom:
131+
secretKeyRef:
132+
name: hf-token-secret
133+
key: token
134+
- name: TEST_SELECTOR
135+
value: "llama70B"
136+
137+
138+
# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
139+
# - label: "A100 trt benchmark"
140+
# priority: 100
141+
# agents:
142+
# queue: A100
143+
# plugins:
144+
# - kubernetes:
145+
# podSpec:
146+
# <<: *common_pod_spec
147+
# containers:
148+
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
149+
# <<: *common_container_settings
150+
151+
152+
# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
153+
# - label: "A100 tgi benchmark"
154+
# priority: 100
155+
# agents:
156+
# queue: A100
157+
# plugins:
158+
# - kubernetes:
159+
# podSpec:
160+
# <<: *common_pod_spec
161+
# containers:
162+
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
163+
# <<: *common_container_settings
88164

89165
- wait
90166

91-
- label: "Plot"
167+
- label: "Collect the results"
92168
priority: 100
93169
agents:
94170
queue: A100
@@ -117,4 +193,4 @@ steps:
117193
name: hf-token-secret
118194
key: token
119195

120-
- wait
196+
- block: ":rocket: check the results!"

.buildkite/nightly-benchmarks/run-nightly-suite.sh

-76
This file was deleted.

0 commit comments

Comments
 (0)