Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: 5x slower throught with openAI client/server than native one #7935

Closed
1 task done
stas00 opened this issue Aug 28, 2024 · 21 comments
Closed
1 task done
Labels
performance Performance-related issues

Comments

@stas00
Copy link
Contributor

stas00 commented Aug 28, 2024

Proposal to improve performance

I've been trying to write a reliable benchmark to be used with vllm, and I discovered that when I use the openAI client it can't scale. If I try to use 50 concurrent clients the gpu load goes down to 5% and the throughput is extremely slow. The more clients I add the worst things get. With a single client there is no problem.

I then used the same benchmark switching to the vllm native client/server and I'm getting a 60-70% gpu util and 5x higher throughput.

I checked that I had the same SamplingParams reported by the server in both cases.

In parallel with those I was using https://github.com/grafana/k6 against both uses cases - with openAI entrypoints and with the native entrypoint - I can confirm that the server isn't the problem - in both cases I get high gpu util with k6 client and high throughput.

I thought that perhaps streaming was the cause but disabling it made a very small difference.

So everything points to the openAI client - I know that it's not your product but you recommend using it with the openAI entrypoint:

"""Example Python client for vllm.entrypoints.api_server
NOTE: The API server is used only for demonstration and simple performance
benchmarks. It is not intended for production use.
For production use, we recommend vllm serve and the OpenAI client API.

So perhaps you have some insights to what I'm missing? I'm just using your examples as is.

vllm==0.5.5 here

Thank you!

Your current environment (if you think it is necessary)

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1017-gcp-tcpx-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3

Nvidia driver version: 550.90.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        52 bits physical, 57 bits virtual
CPU(s):                               208
On-line CPU(s) list:                  0-207
Thread(s) per core:                   2
Core(s) per socket:                   52
Socket(s):                            2
NUMA node(s):                         2
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                143
Model name:                           Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
Stepping:                             8
CPU MHz:                              2699.998
BogoMIPS:                             5399.99
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            4.9 MiB
L1i cache:                            3.3 MiB
L2 cache:                             208 MiB
L3 cache:                             210 MiB
NUMA node0 CPU(s):                    0-51,104-155
NUMA node1 CPU(s):                    52-103,156-207
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities

Versions of relevant libraries:
[pip3] flake8==7.1.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnxruntime==1.18.1
[pip3] qtorch==0.3.0
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] qtorch                    0.3.0                    pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.44.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    57-63,161-166   1               N/A
GPU1    NV18     X      57-63,161-166   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@stas00 stas00 added the performance Performance-related issues label Aug 28, 2024
@njhill
Copy link
Member

njhill commented Aug 28, 2024

Thanks @stas00. There may be scalability issues with the client but we are also aware of significant overhead in the current openai server layer that we are actively working on addressing.

It's probably also not a good idea to create multiple client instances in the same proc, I'd suggest to use a single client with asyncio or multiple threads.

@stas00
Copy link
Contributor Author

stas00 commented Aug 28, 2024

Thank you for the suggestions, Nick.

I have already tried using multi-proc, it makes a very marginal improvement over multiple threads.

I have been using this approach:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
    #with ProcessPoolExecutor() as executor:
    with ThreadPoolExecutor(max_workers=num_clients) as executor:
        for client_id, results in zip(clients, executor.map(parallel_request, clients)):
            ...

ProcessPoolExecutor does a tad better than ThreadPoolExecutor when using many clients, but it's still failing to scale.

I checked the node I run the clients on is totally underloaded - 200+ cpu cores and 2TB of RAM. Running k6 js client on the same instance is a breeze - scales great - high gpu util.

As I said the same ProcessPoolExecutor and ThreadPoolExecutor code scales just fine using your native solution https://docs.vllm.ai/en/latest/getting_started/examples/api_client.html - which suggests that the parallelization implementation is not the bottleneck.

So what would you recommend I use as a python client that I could crank up the concurrency with?


but we are also aware of significant overhead in the current openai server layer that we are actively working on addressing.

But I have no scalability issue with the openai server of yours, if I use k6 client I get high gpu util and high throughput - it's the openAI completions client that causes the problem somehow.

It feels like the client is somehow blocking the server from continuing its compute, as in the server comm layer is blocking compute - shouldn't they be async? i.e. the server should continue its compute w/o waiting for the client to receive the generated tokens so far.

@vllm-project vllm-project deleted a comment Aug 28, 2024
@njhill
Copy link
Member

njhill commented Aug 28, 2024

I was suggesting instead of creating a client per worker, try having them all use the same client instance. I'm not sure whether the client is threadsafe so this may or may not work. An alternative would be to use the async variant of the client.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Have you seen the example in benchmarks/benchmark_server.py

@stas00
Copy link
Contributor Author

stas00 commented Aug 28, 2024

Thank you for this suggestion, Nick. I wasn't aware of the asyncio openAI API. Should it be mentioned somewhere in the performance section? I think it's critical for vllm, since if the client is the bottleneck it's vllm's acceptance that will suffer.

So I extended my benchmark to include asyncio version - it did improve the speed marginally but we are still miles away from high gpu util using it. This is a 10% improvement vs 500% needed to match other clients.

Non-streaming results are averages for 50 concurrent clients using openai api:

|  Stage   |  Latency  | Elapsed | Tokens |
|          |  tok/sec  | seconds |        |
| :------- | --------: | ------: | -----: |
| Combined |     19.15 |   80.41 |   1540 |



Non-streaming results are averages for 50 concurrent clients using async-openai api:

|  Stage   |  Latency  | Elapsed | Tokens |
|          |  tok/sec  | seconds |        |
| :------- | --------: | ------: | -----: |
| Combined |     21.01 |   71.81 |   1508 |

going to look into benchmarks/benchmark_server.py next - thank you for the suggestion, @robertgshaw2-neuralmagic

@stas00
Copy link
Contributor Author

stas00 commented Aug 28, 2024

@robertgshaw2-neuralmagic, your suggestion to use https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py seems to be a good one! I need to study it to see where my naive implementation is inefficient.

Perhaps it'd be useful to document it as something a user could use to benchmark their use-cases? I somehow missed it as it looked dev-facing and thus started writing my own, but it looks to be totally ready to be used by end users.

Also how does this benchmark linked from the top level README.md gets updated? https://buildkite.com/vllm/performance-benchmark/builds/4068 - it seems to be quite old - many versions were published since it was made - I'm not sure how to find the most recent one - I see only partial benchmark reports on that website.

But good news, now that I have this tool - I can go and try your autoAWQ suggestion, Robert.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 28, 2024

@robertgshaw2-neuralmagic, your suggestion to use https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py seems to be a good one! I need to study it to see where my naive implementation is inefficient.

Perhaps it'd be useful to document it as something a user could use to benchmark their use-cases? I somehow missed it as it looked dev-facing and thus started writing my own, but it looks to be totally ready to be used by end users.

Also how does this benchmark linked from the top level README.md gets updated? https://buildkite.com/vllm/performance-benchmark/builds/4068 - it seems to be quite old - many versions were published since it was made - I'm not sure how to find the most recent one - I see only partial benchmark reports on that website.

But good news, now that I have this tool - I can go and try your autoAWQ suggestion, Robert.

Couple things:

  • the async openai client has a max concurrency (I too ran into this problem trying to simplify some benchmarking runs I was doing). So, even if you send 10000 requests, only a subset will actually go up to the server
  • the benchmark script uses aiohttp, and this limit does not exist
  • the benchmarking in the ci uses these scripts. I think we might need to change the link as my understanding is that this runs nightly
  • the benchmark scripts are targeted at developers, but you can feel free to use it as you see fit and we welcome any PRs to improve our docs

RE: quantization - please try out LLM-compressor!

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

the async openai client has a max concurrency (I too ran into this problem trying to simplify some benchmarking runs I was doing). So, even if you send 10000 requests, only a subset will actually go up to the server

That's super useful to know, Robert. Do you by chance know if this documented somewhere, in particular what is the current limit and what does it depend on? cpu-cores, ram, else? is there a plan to fix that (Issue in question?) I'm aware that this is outside of vllm.

And what concurrency did you hit in your experiments that it was still ok? i.e. what is the size of the subset you mentioned.

the rest!

appreciate the notes - let me experiment some more with this new tool so that I feel comfortable with getting consistent results (as I see some fluctuations which would impact the measuring of optimization features). Any suggestions to how I should run this benchmark so that I get more consistent outputs if I repeat the same benchmark multiple times in a row?

Right now I have been using:

python benchmark_serving.py \
    --backend vllm \
    --model $model \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --port 9999 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 50 \
    --request-rate inf \
    --seed 42

probably more prompts?

Also is there a way this benchmark can tell me the upper limit of concurrent requests vllm can handle before it starts impacting per-user throughput / ttft? other than doing multiple tries and finding the concurrency where TTFT+decode througput is reasonable still?

quantization - please try out LLM-compressor!

yes, now I'm going to re-try them all. I tried them first with a naive single client and couldn't tell any difference. So now I should be good to see the actual impact.

@robertgshaw2-neuralmagic
Copy link
Collaborator

the async openai client has a max concurrency (I too ran into this problem trying to simplify some benchmarking runs I was doing). So, even if you send 10000 requests, only a subset will actually go up to the server

That's super useful to know, Robert. Do you by chance know if this documented somewhere, in particular what is the current limit and what does it depend on? cpu-cores, ram, else? is there a plan to fix that (Issue in question?) I'm aware that this is outside of vllm.

I found out when I was observing the vllm logs and then inspecting the openai client source code.

I don't quite understand the use case for sending N requests from the same client though, other than for benchmarking. Can you elaborate more on your use case?

And what concurrency did you hit in your experiments that it was still ok? i.e. what is the size of the subset you mentioned.

the rest!

appreciate the notes - let me experiment some more with this new tool so that I feel comfortable with getting consistent results (as I see some fluctuations which would impact the measuring of optimization features). Any suggestions to how I should run this benchmark so that I get more consistent outputs if I repeat the same benchmark multiple times in a row?

Right now I have been using:

python benchmark_serving.py \
    --backend vllm \
    --model $model \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --port 9999 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 50 \
    --request-rate inf \
    --seed 42

probably more prompts?

Also is there a way this benchmark can tell me the upper limit of concurrent requests vllm can handle before it starts impacting per-user throughput / ttft? other than doing multiple tries and finding the concurrency where TTFT+decode througput is reasonable still?

The key thing with quantization is that performance is a function of the request rate on the application. So, the sample you have here is effectively an offline batch use case.

Personally, I like to look at TPOT and TTFT as a function of queries per second on the model. In my blog, I discuss how various quantization schemes impact performance

I included details on how to replicate the benchmarks:

MODEL=MODEL_STUB_TO_BENCHMARK
TOTAL_SECONDS=120
QPS_RATES=("1" "3" "5" "7" "9")

for QPS in ${QPS_RATES[@]}; do
    NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
    echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="

    python3 benchmarks/benchmark_serving.py \
        --model $MODEL \
        --dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 \
        --dataset-path benchmarks/sonnet.txt \
        --num-prompts $NUM_PROMPTS --request-rate $QPS
done

quantization - please try out LLM-compressor!

yes, now I'm going to re-try them all. I tried them first with a naive single client and couldn't tell any difference. So now I should be good to see the actual impact.

One more note: we have some performance overheads in the OpenAI server we are very close to resolving. I can ping you again once these are complete.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Please feel free to post your results here once you have them, I can take a look and help you understand what is going on.

@binarycrayon
Copy link

we are seeing this as well. Reverting to 0.5.3
Also subbed this.

@robertgshaw2-neuralmagic
Copy link
Collaborator

we are seeing this as well. Reverting to 0.5.3 Also subbed this.

@binarycrayon This issue does not have anything to do with vLLM version - can you elaborate on what you are seeing?

@binarycrayon
Copy link

I'm running vllm with 9 loras adapters with openai server against our product.
The token per second has dropped significantly in 0.5.5 compared to 0.5.3. I could provide some data later but a bit tied up this week.
Also 0.5.4 seemed fine but that release introduced bug in grafana metrics so we skipped it.

@robertgshaw2-neuralmagic
Copy link
Collaborator

I'm running vllm with 9 loras adapters with openai server against our product.
The token per second has dropped significantly in 0.5.5 compared to 0.5.3. I could provide some data later but a bit tied up this week.
Also 0.5.4 seemed fine but that release introduced bug in grafana metrics so we skipped it.

Thank you. Could you try running on v0.5.5 with —disable-frontend-multiprocessing in the launch script?

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

I don't quite understand the use case for sending N requests from the same client though, other than for benchmarking. Can you elaborate more on your use case?

Benchmarking. There are ~6 different quantization techniques theoretically supported by vllm. So how would you know which one works the best if you don't measure performance? (assuming quality is on par).

It's a big thing - all these competing frameworks - so it's critical to be able to quickly measure which one delivers the best speed at given hardware, while keeping quality.

I think ideally I'd have an abstraction layer where the server software can use multiple frameworks and switch between them at will, depending on the use case. For example, if we look at https://buildkite.com/vllm/performance-benchmark/builds/4068 clearly - there is no clear winner:

snapshot_984

and even if there was one - a few months later the winner is likely to be loser and vice versa.

So as I flagged earlier that benchmark you shared is 2-months old and surely at least vllm has improved since then, so what's the point of showing probably no longer true information now.

One more note: we have some performance overheads in the OpenAI server we are very close to resolving. I can ping you again once these are complete.

Yes, please, Robert!

Please feel free to post your results here once you have them, I can take a look and help you understand what is going on.

This is much appreciated, Robert - I will do that!

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

And I found your openai completions client replacement with aiohttp:

async def async_request_openai_completions(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
api_url = request_func_input.api_url
assert api_url.endswith(
"completions"
), "OpenAI Completions API URL must end with 'completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
payload = {
"model": request_func_input.model,
"prompt": request_func_input.prompt,
"temperature": 0.0,
"best_of": request_func_input.best_of,
"max_tokens": request_func_input.output_len,
"stream": True,
}
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len
generated_text = ""
ttft = 0.0
st = time.perf_counter()
most_recent_timestamp = st
try:
async with session.post(url=api_url, json=payload,
headers=headers) as response:
if response.status == 200:
async for chunk_bytes in response.content:
chunk_bytes = chunk_bytes.strip()
if not chunk_bytes:
continue
chunk = remove_prefix(chunk_bytes.decode("utf-8"),
"data: ")
if chunk == "[DONE]":
latency = time.perf_counter() - st
else:
data = json.loads(chunk)
# NOTE: Some completion API might have a last
# usage summary response without a token so we
# want to check a token was generated
if data["choices"][0]["text"]:
timestamp = time.perf_counter()
# First token
if ttft == 0.0:
ttft = time.perf_counter() - st
output.ttft = ttft
# Decoding phase
output.itl.append(timestamp -
most_recent_timestamp)
most_recent_timestamp = timestamp
generated_text += data["choices"][0]["text"]
output.generated_text = generated_text
output.success = True
output.latency = latency
else:
output.error = response.reason or ""
output.success = False
except Exception:
output.success = False
exc_info = sys.exc_info()
output.error = "".join(traceback.format_exception(*exc_info))
if pbar:
pbar.update(1)
return output

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

ok, so I will be running various setups all around a llama2-8b model w/ vllm==0.5.5

Baseline:

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  6.86
Total input tokens:                      12180
Total generated tokens:                  11502
Request throughput (req/s):              7.29
Input token throughput (tok/s):          1775.38
Output token throughput (tok/s):         1676.56
---------------Time to First Token----------------
Mean TTFT (ms):                          232.39
Median TTFT (ms):                        273.13
P99 TTFT (ms):                           281.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.41
Median TPOT (ms):                        10.46
P99 TPOT (ms):                           22.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.54
Median ITL (ms):                         9.42
P99 ITL (ms):                            13.13
==================================================

❌ AutoAWQ (worse than the baseline)

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  8.46
Total input tokens:                      12180
Total generated tokens:                  11452
Request throughput (req/s):              5.91
Input token throughput (tok/s):          1439.98
Output token throughput (tok/s):         1353.91
---------------Time to First Token----------------
Mean TTFT (ms):                          257.20
Median TTFT (ms):                        246.24
P99 TTFT (ms):                           321.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.22
Median TPOT (ms):                        12.84
P99 TPOT (ms):                           26.48
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.78
Median ITL (ms):                         11.67
P99 ITL (ms):                            15.65
==================================================


❌ BNB (much worse than the baseline)

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  26.51
Total input tokens:                      12180
Total generated tokens:                  10946
Request throughput (req/s):              1.89
Input token throughput (tok/s):          459.46
Output token throughput (tok/s):         412.91
---------------Time to First Token----------------
Mean TTFT (ms):                          375.60
Median TTFT (ms):                        386.13
P99 TTFT (ms):                           484.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.58
Median TPOT (ms):                        35.92
P99 TPOT (ms):                           66.11
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.40
Median ITL (ms):                         34.77
P99 ITL (ms):                            40.73
==================================================

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

And the good results:

✅ INT8 W8A8 (better than the baseline)

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  6.18
Total input tokens:                      12180
Total generated tokens:                  11499
Request throughput (req/s):              8.10
Input token throughput (tok/s):          1972.46
Output token throughput (tok/s):         1862.18
---------------Time to First Token----------------
Mean TTFT (ms):                          209.19
Median TTFT (ms):                        244.13
P99 TTFT (ms):                           248.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.74
Median TPOT (ms):                        9.47
P99 TPOT (ms):                           28.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.70
Median ITL (ms):                         8.63
P99 ITL (ms):                            12.28
==================================================

✅ FP8 W8A8 (on the fly / dynamic quantization) (better than the baseline)

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  6.50
Total input tokens:                      12180
Total generated tokens:                  11515
Request throughput (req/s):              7.69
Input token throughput (tok/s):          1874.02
Output token throughput (tok/s):         1771.70
---------------Time to First Token----------------
Mean TTFT (ms):                          212.84
Median TTFT (ms):                        255.63
P99 TTFT (ms):                           260.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.36
Median TPOT (ms):                        9.88
P99 TPOT (ms):                           25.54
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.06
Median ITL (ms):                         8.84
P99 ITL (ms):                            13.05
==================================================

✅ FP8 W8A8 (llmcompressor==0.1.0) (better than the baseline)

============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  6.01
Total input tokens:                      12180
Total generated tokens:                  11410
Request throughput (req/s):              8.32
Input token throughput (tok/s):          2026.66
Output token throughput (tok/s):         1898.54
---------------Time to First Token----------------
Mean TTFT (ms):                          199.64
Median TTFT (ms):                        224.42
P99 TTFT (ms):                           246.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.38
Median TPOT (ms):                        9.23
P99 TPOT (ms):                           20.40
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.43
Median ITL (ms):                         8.24
P99 ITL (ms):                            11.52
==================================================

@robertgshaw2-neuralmagic
Copy link
Collaborator

What QPS rate are you running at?

I would suggest running the serving experiments for ~2 minutes, such that the server can come into equilibrium

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

I suppose it's qps=50 since I have 50 prompts sent all at once, or do you measure qps differently? here is how I run the benchmark

python benchmark_serving.py \
    --backend vllm \
    --model $MODEL \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --port 9999 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 50 \
    --request-rate inf \
    --seed 42

The problem is that if I make the num-prompts much higher it's likely to hit a queue, no? then the measurements would be wrong for the purpose of the benchmark.

That's why earlier I asked how to find the threshold at which vllm starts queueing up the requests. I think this is a crucial metric as well, since it really tells the server's capacity. Once a request is queued the TTFT is going to be bad.

It sounds like the benchmark needs a new config to tell it how long to run for?

A typical benchmark usually has a warmup period. Though how would one warmup here? I guess run the first X requests and don't count them towards stats?

@stas00
Copy link
Contributor Author

stas00 commented Aug 29, 2024

KV Cache quantizations experiment results:

  1. FP8 E5M2 KV Cache made no difference to the bottom line. perhaps it's not doing anything, because I get an identical outcome compared to the baseline.

  2. FP8 E4M3 KV Cache (to be figured out yet as it's not documented well) https://docs.vllm.ai/en/latest/quantization/fp8_e4m3_kvcache.html

    In fact the info is completely outdated - as nvidia renamed the package - tracking the new changes here: [Doc]: nvidia ammo has been renamed #8010

@stas00 stas00 closed this as completed Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

4 participants