|
1 | 1 |
|
2 | 2 | # Nightly benchmark
|
3 | 3 |
|
4 |
| -The main goal of this benchmarking is two-fold: |
5 |
| -- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload. |
6 |
| -- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md](). |
7 |
| - |
8 |
| - |
9 |
| -## Docker images |
10 |
| - |
11 |
| -We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images: |
12 |
| -- vllm/vllm-openai:v0.5.0.post1 |
13 |
| -- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 |
14 |
| -- openmmlab/lmdeploy:v0.5.0 |
15 |
| -- ghcr.io/huggingface/text-generation-inference:2.1 |
16 |
| - |
17 |
| -<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. --> |
18 |
| - |
19 |
| - |
20 |
| -## Hardware |
21 |
| - |
22 |
| -One AWS node with 8x NVIDIA A100 GPUs. |
23 |
| - |
24 |
| - |
25 |
| -## Workload description |
26 |
| - |
27 |
| -We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload: |
28 |
| - |
29 |
| -- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed). |
30 |
| -- Output length: the corresponding output length of these 500 prompts. |
31 |
| -- Models: llama-3 8B, llama-3 70B, mixtral 8x7B. |
32 |
| -- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed). |
33 |
| -- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). |
34 |
| - |
35 |
| -<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. --> |
36 |
| - |
37 |
| -## Plots |
38 |
| - |
39 |
| -In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed. |
40 |
| - |
41 |
| -<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 > |
42 |
| - |
43 |
| -## Results |
44 |
| - |
45 |
| -{nightly_results_benchmarking_table} |
| 4 | +This benchmark aims to: |
| 5 | +- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload. |
| 6 | +- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions. |
| 7 | + |
| 8 | +Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end. |
| 9 | + |
| 10 | +Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176) |
| 11 | + |
| 12 | + |
| 13 | +## Setup |
| 14 | + |
| 15 | +- Docker images: |
| 16 | + - vLLM: `vllm/vllm-openai:v0.6.2` |
| 17 | + - SGLang: `lmsysorg/sglang:v0.3.2-cu121` |
| 18 | + - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` |
| 19 | + - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` |
| 20 | + - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* |
| 21 | + - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. |
| 22 | +- Hardware |
| 23 | + - 8x Nvidia A100 GPUs |
| 24 | +- Workload: |
| 25 | + - Dataset |
| 26 | + - ShareGPT dataset |
| 27 | + - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) |
| 28 | + - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) |
| 29 | + - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. |
| 30 | + - Models: llama-3 8B, llama-3 70B. |
| 31 | + - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). |
| 32 | + - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. |
| 33 | + - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. |
| 34 | + - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). |
| 35 | + |
| 36 | +# Known issues |
| 37 | + |
| 38 | +- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105). |
| 39 | +- TGI does not support `ignore-eos` flag. |
0 commit comments