-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ScaleLLM vs vLLM in performance #144
Comments
Hi @guocuimi Thanks for your outstanding work. In addition to performance comparison with vLLM, if possible, please consider adding TensorRT-LLM, LMDeploy, RTP-LLM, and TGI. And maybe we could use vLLM benchmark serving. Thanks. |
thank you for your interest in ScaleLLM. Yeah, it is indeed in our roadmap. we do have some internal numbers but not ready to share yet. As part of our upcoming plans, we will do a comprehensive comparisons (in a separate ropro) in coming weeks after finishing python wrapper part. Stay tuned! Meanwhile, feel free to conduct your own benchmarks for your specific scenarios using the vLLM benchmark serving script. Thanks. |
Hi @guocuimi May you use GitHub Action to release the Python Package? Consider supporting CUDA 11.8 and CUDA 12.2, which will make it more convenient for users to use. At the same time, we can easily compare performance with other frameworks by using the compatible OpenAI Server. |
thanks for your advice. yeah, it is our plan. i am working on setting up the whl build for each release. for now, i am trying to reduce whl size first. should be ready this week. stay tuned! |
Hi @zhyncs A quick update for you: python is supported in latest release. |
cool! I will verify it asap, thanks. |
Hi @guocuimi The package you are currently compiling in GitHub Action depends on |
Thanks for letting me know. Let me try to downgrade GCC to 10 and republish new packages using manylinux2014 Toolchain: GCC 10 |
The latest version is ok https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.1.3 |
If the interface is compatible, then we may directly use vLLM's script for benchmark at https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py The recent BentoML blog post https://www.bentoml.com/blog/benchmarking-llm-inference-backends can also serve as a reference. |
Yeah, we can use it directly. Just sharing our plans on this: One
continuous benchmark will be setup in coming weeks, comparing offline and
serving metrics between scalellm, vllm and tensorrt-llm. We are working on
it.
Thanks,
Michael Mi
…On Mon, Jun 10, 2024 at 7:44 PM zhyncs ***@***.***> wrote:
If the interface is compatible, then we may directly use vLLM's script for
benchmark at
https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
The recent BentoML blog post
https://www.bentoml.com/blog/benchmarking-llm-inference-backends can also
serve as a reference.
—
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABV4CLUVFSMEAZ3Y232FHJDZGZQBJAVCNFSM6AAAAABG4ASAOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGY3DKNJZGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I gave it a quick try, and there seems to be a problem at the moment. python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf
python3 benchmark_serving.py --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1 --request-rate 1 |
Looking forward to your results. |
Thanks, never tried that benchmark script. Will try that after wrapping up current feature parity works for logprobs and best_of. Thanks |
Hi @guocuimi How is the progress going? Looking forward to your results. Cheers. |
Oh, sorry for delay response. my bad, i forget to update you here. latest package should be good with this benchmark tool. |
Great work! Cheers. |
Thanks for your interest in our project. we appreciated it. Please let us know if you see any issues. Please note that we are still in the alpha stage and do not intend to compete with other open-source solutions at this time. :) We have a bunch of features in the pipeline, and we want to finish them first before sharing benchmarks with a broader audience. thanks for your understanding. |
Hi @guocuimi It's blazing fast. This is the benchmark results. If you have any questions, please contact me. Thanks. cc @lzhangzz @lvhan028 @grimoire # TurboMind
python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf
# ScaleLLM
python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf
# TurboMind
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 23333 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model /workdir/Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128
# ScaleLLM
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128
|
Is there comparison performance data between ScaleLLM and vLLM
The text was updated successfully, but these errors were encountered: