ScaleLLM vs vLLM in performance #144

WangErXiao · 2024-04-27T11:28:22Z

Is there comparison performance data between ScaleLLM and vLLM

zhyncs · 2024-04-27T15:12:43Z

Hi @guocuimi Thanks for your outstanding work. In addition to performance comparison with vLLM, if possible, please consider adding TensorRT-LLM, LMDeploy, RTP-LLM, and TGI. And maybe we could use vLLM benchmark serving. Thanks.

guocuimi · 2024-04-27T18:55:02Z

thank you for your interest in ScaleLLM. Yeah, it is indeed in our roadmap. we do have some internal numbers but not ready to share yet. As part of our upcoming plans, we will do a comprehensive comparisons (in a separate ropro) in coming weeks after finishing python wrapper part. Stay tuned!

Meanwhile, feel free to conduct your own benchmarks for your specific scenarios using the vLLM benchmark serving script. Thanks.

zhyncs · 2024-05-10T13:57:25Z

Hi @guocuimi May you use GitHub Action to release the Python Package? Consider supporting CUDA 11.8 and CUDA 12.2, which will make it more convenient for users to use. At the same time, we can easily compare performance with other frameworks by using the compatible OpenAI Server.

guocuimi · 2024-05-10T17:29:01Z

thanks for your advice. yeah, it is our plan. i am working on setting up the whl build for each release. for now, i am trying to reduce whl size first. should be ready this week. stay tuned!

guocuimi · 2024-05-17T16:29:46Z

Hi @zhyncs A quick update for you: python is supported in latest release.
you can install scalellm with pip: pip install scalellm and start rest api server with python3 -m scalellm.serve.api_server
Please let me know if you have any questions. thanks

zhyncs · 2024-05-20T06:47:48Z

Hi @zhyncs A quick update for you: python is supported in latest release. you can install scalellm with pip: pip install scalellm and start rest api server with python3 -m scalellm.serve.api_server Please let me know if you have any questions. thanks

cool! I will verify it asap, thanks.

zhyncs · 2024-05-20T06:51:41Z

Hi @guocuimi The package you are currently compiling in GitHub Action depends on GLIBC_2.27, which is not very friendly for the widely used CentOS 7 in the industry and still requires manual compilation.

guocuimi · 2024-05-20T07:12:02Z

Thanks for letting me know. Let me try to downgrade GCC to 10 and republish new packages using manylinux2014
(CentOS 7 based)

Toolchain: GCC 10

zhyncs · 2024-06-11T02:40:43Z

republish new packages

The latest version is ok https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.1.3
And may you update the doc https://github.com/vectorch-ai/ScaleLLM/tree/main/docs/source For example, how to set up an OpenAI-compatible server. Thanks.

zhyncs · 2024-06-11T02:44:15Z

If the interface is compatible, then we may directly use vLLM's script for benchmark at https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py

The recent BentoML blog post https://www.bentoml.com/blog/benchmarking-llm-inference-backends can also serve as a reference.

guocuimi · 2024-06-11T02:50:08Z

Yeah, we can use it directly. Just sharing our plans on this: One continuous benchmark will be setup in coming weeks, comparing offline and serving metrics between scalellm, vllm and tensorrt-llm. We are working on it. Thanks, Michael Mi

…

On Mon, Jun 10, 2024 at 7:44 PM zhyncs ***@***.***> wrote: If the interface is compatible, then we may directly use vLLM's script for benchmark at https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py The recent BentoML blog post https://www.bentoml.com/blog/benchmarking-llm-inference-backends can also serve as a reference. — Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABV4CLUVFSMEAZ3Y232FHJDZGZQBJAVCNFSM6AAAAABG4ASAOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGY3DKNJZGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zhyncs · 2024-06-11T03:12:38Z

we can use it directly

I gave it a quick try, and there seems to be a problem at the moment.

https://github.com/vllm-project/vllm/blob/351d5e7b8253d754b2a951152cd48927c4c1629d/benchmarks/backend_request_func.py#L261-L262

python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf

python3 benchmark_serving.py --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1 --request-rate 1

zhyncs · 2024-06-11T03:13:32Z

in coming weeks

Looking forward to your results.

guocuimi · 2024-06-11T03:24:47Z

Thanks, never tried that benchmark script. Will try that after wrapping up current feature parity works for logprobs and best_of. Thanks

zhyncs · 2024-07-01T05:25:53Z

Thanks, never tried that benchmark script. Will try that after wrapping up current feature parity works for logprobs and best_of. Thanks

Hi @guocuimi How is the progress going? Looking forward to your results. Cheers.

guocuimi · 2024-07-01T06:18:16Z

Oh, sorry for delay response. my bad, i forget to update you here. latest package should be good with this benchmark tool.
Please feel free to try it again, and don't be surprised if you see a significant improvements! :)

zhyncs · 2024-07-01T06:19:18Z

Oh, sorry for delay response. my bad, i forget to update you here. latest package should be good with this benchmark tool. Please feel free to try it again, and don't be surprised if you see a significant improvements! :)

Great work! Cheers.

guocuimi · 2024-07-01T06:23:12Z

Thanks for your interest in our project. we appreciated it. Please let us know if you see any issues.

Please note that we are still in the alpha stage and do not intend to compete with other open-source solutions at this time. :) We have a bunch of features in the pipeline, and we want to finish them first before sharing benchmarks with a broader audience. thanks for your understanding.

zhyncs · 2024-07-01T08:19:53Z

Hi @guocuimi It's blazing fast. This is the benchmark results. If you have any questions, please contact me. Thanks. cc @lzhangzz @lvhan028 @grimoire

# TurboMind
python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf

# ScaleLLM
python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf

# TurboMind
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 23333 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json  --model /workdir/Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128

# ScaleLLM
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json  --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128

# TurboMind

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  108.19
Total input tokens:                      245995
Total generated tokens:                  196273
Request throughput (req/s):              9.24
Input token throughput (tok/s):          2273.65
Output token throughput (tok/s):         1814.09
---------------Time to First Token----------------
Mean TTFT (ms):                          29953.87
Median TTFT (ms):                        27039.17
P99 TTFT (ms):                           80858.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.79
Median TPOT (ms):                        59.59
P99 TPOT (ms):                           278.76
---------------Inter-token Latency----------------
Mean ITL (ms):                           215.63
Median ITL (ms):                         44.80
P99 ITL (ms):                            290.98
==================================================

# ScaleLLM
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  115.64
Total input tokens:                      245995
Total generated tokens:                  195334
Request throughput (req/s):              8.65
Input token throughput (tok/s):          2127.27
Output token throughput (tok/s):         1689.17
---------------Time to First Token----------------
Mean TTFT (ms):                          33849.20
Median TTFT (ms):                        33777.98
P99 TTFT (ms):                           84234.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.95
Median TPOT (ms):                        66.42
P99 TPOT (ms):                           86.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           237.65
Median ITL (ms):                         102.24
P99 ITL (ms):                            114.65
==================================================

zhyncs · 2024-07-01T08:20:28Z

ref https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py

guocuimi added the roadmap label Apr 28, 2024

guocuimi added this to ScaleLLM 2024 H1 Roadmap Apr 28, 2024

zhyncs mentioned this issue Jul 1, 2024

[Misc] update benchmark backend for scalellm vllm-project/vllm#6018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScaleLLM vs vLLM in performance #144

ScaleLLM vs vLLM in performance #144

WangErXiao commented Apr 27, 2024

zhyncs commented Apr 27, 2024

guocuimi commented Apr 27, 2024 •

edited

Loading

zhyncs commented May 10, 2024

guocuimi commented May 10, 2024

guocuimi commented May 17, 2024

zhyncs commented May 20, 2024

zhyncs commented May 20, 2024

guocuimi commented May 20, 2024

zhyncs commented Jun 11, 2024

zhyncs commented Jun 11, 2024

guocuimi commented Jun 11, 2024 via email

zhyncs commented Jun 11, 2024

zhyncs commented Jun 11, 2024

guocuimi commented Jun 11, 2024

zhyncs commented Jul 1, 2024

guocuimi commented Jul 1, 2024

zhyncs commented Jul 1, 2024

guocuimi commented Jul 1, 2024

zhyncs commented Jul 1, 2024 •

edited

Loading

zhyncs commented Jul 1, 2024

ScaleLLM vs vLLM in performance #144

ScaleLLM vs vLLM in performance #144

Comments

WangErXiao commented Apr 27, 2024

zhyncs commented Apr 27, 2024

guocuimi commented Apr 27, 2024 • edited Loading

zhyncs commented May 10, 2024

guocuimi commented May 10, 2024

guocuimi commented May 17, 2024

zhyncs commented May 20, 2024

zhyncs commented May 20, 2024

guocuimi commented May 20, 2024

zhyncs commented Jun 11, 2024

zhyncs commented Jun 11, 2024

guocuimi commented Jun 11, 2024 via email

zhyncs commented Jun 11, 2024

zhyncs commented Jun 11, 2024

guocuimi commented Jun 11, 2024

zhyncs commented Jul 1, 2024

guocuimi commented Jul 1, 2024

zhyncs commented Jul 1, 2024

guocuimi commented Jul 1, 2024

zhyncs commented Jul 1, 2024 • edited Loading

zhyncs commented Jul 1, 2024

guocuimi commented Apr 27, 2024 •

edited

Loading

zhyncs commented Jul 1, 2024 •

edited

Loading