Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serving Benchmark Refactoring #2433

Merged
merged 40 commits into from
Feb 13, 2024
Merged

Serving Benchmark Refactoring #2433

merged 40 commits into from
Feb 13, 2024

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Jan 13, 2024

The goal of this PR is to refactor the current online serving benchmark script to make it easier to use and contribute to as well as include more features. Some major items are

  • Refactor backend query function out of main benchmark script so adding more backend support is easier/cleaner.
    • TGI
    • vLLM
    • TensorRT-LLM
    • OpenAI Completions
    • DeepSpeed-MII
  • Token-level throughput information instead of request-level
  • Add median/P99 TPOT in addition to average TPOT
  • Add option to save results to a json file
  • Add TTFT measurement
    • Note: deepspeed-mii does not have official support for streaming as of Jan 29, 2024 and has a PR in progress.

Some other items that can be included are:

  • Allow sampling input & output lengths from a distribution (Currently we have a fixed < 1024 input and < 2048 input + output setup)
  • Add a latency benchmark (similar to benchmark_latency.py where we run synchronous requests against the server to measure the best-scenario latency performance of the engine/backend)

@zhaoyang-star
Copy link
Contributor

zhaoyang-star commented Jan 16, 2024

Adding deepspeed mii as an alternate backend will be expeted.

@ywang96
Copy link
Member Author

ywang96 commented Jan 16, 2024

Adding deepspeed mii as an a alternate backend will be expeted.

Yea that's definitely doable. Since this benchmark is running against a server, I'm considering taking this the default way to deploy a model server with mii deepspeed.

@ywang96 ywang96 marked this pull request as ready for review January 17, 2024 19:29
@ywang96
Copy link
Member Author

ywang96 commented Jan 17, 2024

Here's a sample output from running this version of benchmark script

Traffic request rate: inf
Successful requests: 10
Benchmark duration: 20.108469 s
Total input tokens: 1522
Total generated tokens: 2211
Reuqest throughput: 0.50 requests/s
Input token throughput: 75.69 tokens/s
Output token throughput: 109.95 tokens/s
Mean latency per output token: 61.17 ms
Median latency per output token: 40.94 ms
P99 latency per output token: 145.59 ms

A few remarks:

  1. Since TensorRT-LLM doesn't come with an API server, I took Triton as default serving backend to serve with TRT-LLM.
  2. For this PR I didn't want to introduce any breaking changes yet, so we may leave any potential item to be added in a later PR.

@LiuXiaoxuanPKU Could you take a first pass on this PR and see if there's anything wrong (mostly the design)? I can iterate on it to refine this PR (e.g, adding scripts for launching servers) once the we agree on the design.

return output


ASYNC_REQUEST_FUNCS = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want to organize these into a class? It may make this a lot cleaner and we can then generate an interface for all future backend benchmarks, which will keep the main class free of major changes if new backends are added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is to keep this particular file as flexible as possible: If someone wants to add support for a new backend "ABC", the only thing they need to do is to add an async function for "ABC" that performs an online inference on a given prompt (could be http or grpc) then add "ABC" to ASYNC_REQUEST_FUNCS without needing to touch the main benchmark script.

We could refactor these async request functions into a class, but again that'll still require implementations of class methods to send request & parse outputs, which are essentially what these functions are doing.

@LiuXiaoxuanPKU
Copy link
Collaborator

Here's a sample output from running this version of benchmark script

Traffic request rate: inf
Successful requests: 10
Benchmark duration: 20.108469 s
Total input tokens: 1522
Total generated tokens: 2211
Reuqest throughput: 0.50 requests/s
Input token throughput: 75.69 tokens/s
Output token throughput: 109.95 tokens/s
Mean latency per output token: 61.17 ms
Median latency per output token: 40.94 ms
P99 latency per output token: 145.59 ms

A few remarks:

  1. Since TensorRT-LLM doesn't come with an API server, I took Triton as default serving backend to serve with TRT-LLM.
  2. For this PR I didn't want to introduce any breaking changes yet, so we may leave any potential item to be added in a later PR.

@LiuXiaoxuanPKU Could you take a first pass on this PR and see if there's anything wrong (mostly the design)? I can iterate on it to refine this PR (e.g, adding scripts for launching servers) once the we agree on the design.

Hi Roger, thanks for the PR. Yeah, the design looks good to us, please go ahead. The only thing we want to confirm is that the performance numbers are similar before and after refactoring.

@ywang96
Copy link
Member Author

ywang96 commented Jan 21, 2024

Hi Roger, thanks for the PR. Yeah, the design looks good to us, please go ahead. The only thing we want to confirm is that the performance numbers are similar before and after refactoring.

Thank you for the response! Sounds good and I'll keep iterating on the PR, and as we discussed offline, TTFT will be added to the metrics to be measured.

@ghost ghost mentioned this pull request Jan 22, 2024
@simon-mo simon-mo self-requested a review January 22, 2024 00:29
@ywang96
Copy link
Member Author

ywang96 commented Jan 30, 2024

@LiuXiaoxuanPKU @simon-mo Here's an output from running the main branch version of benchmark_serving.py and from this branch on mixtral 8x7 (served on 4xA100-80G with v0.2.7 version of vllm) with the ShareGPT dataset.

main branch

Namespace(backend='vllm', protocol='http', host='localhost', port=8000, endpoint='/generate', model=None, dataset='ShareGPT_V3_unfiltered_cleaned_split.json', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', best_of=1, use_beam_search=False, num_prompts=100, request_rate=1.0, seed=0, trust_remote_code=False)
Total time: 111.56 s
Throughput: 0.90 requests/s
Average latency: 6.09 s
Average latency per token: 0.01 s
Average latency per output token: 0.03 s

This branch

Namespace(backend='vllm', version='0.2.7', base_url=None, host='localhost', port=8000, endpoint='/generate', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', best_of=1, use_beam_search=False, num_prompts=100, request_rate=1.0, seed=0, trust_remote_code=False, save_result=False)
Traffic request rate: 1.0
Successful requests: 100
Benchmark duration: 103.926134 s
Total input tokens: 23521
Total generated tokens: 22873
Request throughput: 0.96 requests/s
Input token throughput: 226.32 tokens/s
Output token throughput: 220.09 tokens/s
Mean TTFT: 108.50 ms
Median TTFT: 90.62 ms
P99 TTFT: 197.14 ms
Mean TPOT: 29.21 ms
Median TPOT: 28.18 ms
P99 TPOT: 42.40 ms

One note is that Deepspeed-mii currently does not support streaming, so TTFT will be 0 as a placeholder (I've commented about this in the code itself too)

I can share more results but let me know what you think, Thanks!

@zhuohan123 zhuohan123 mentioned this pull request Jan 31, 2024
30 tasks
@ywang96
Copy link
Member Author

ywang96 commented Feb 12, 2024

Hi @simon-mo! I've refactored the script with dataclasses and edited the serving benchmark portion in the CI. A few last questions I have in mind:

  1. I noticed the benchmark now runs on top of OpenAI API server instead of the /generate API server. Would you say the OpenAI API server should be used for serving with vLLM by default? (If so, we can just map vllm to the generic OpenAI request function)
  2. Initially I made the serving scripts in its own serving directory to make my development easier, but I've moved these files back to top-level under benchmarks. Let me know what you think.
  3. Should we add benchmarks for other backends (TGI, DeepSpeed-MII, etc) into CI or add them to a separate process? I'm indifferent so I'll leave that to you guys to decide - happy to help either way!

@simon-mo simon-mo merged commit a4211a4 into vllm-project:main Feb 13, 2024
18 checks passed
jvmncs pushed a commit to jvmncs/vllm that referenced this pull request Feb 14, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024
@ywang96 ywang96 deleted the benchmark-refactor branch March 4, 2024 01:14
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants