-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serving Benchmark Refactoring #2433
Conversation
Adding deepspeed mii as an alternate backend will be expeted. |
Yea that's definitely doable. Since this benchmark is running against a server, I'm considering taking this the default way to deploy a model server with mii deepspeed. |
Here's a sample output from running this version of benchmark script
A few remarks:
@LiuXiaoxuanPKU Could you take a first pass on this PR and see if there's anything wrong (mostly the design)? I can iterate on it to refine this PR (e.g, adding scripts for launching servers) once the we agree on the design. |
return output | ||
|
||
|
||
ASYNC_REQUEST_FUNCS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want to organize these into a class? It may make this a lot cleaner and we can then generate an interface for all future backend benchmarks, which will keep the main class free of major changes if new backends are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is to keep this particular file as flexible as possible: If someone wants to add support for a new backend "ABC", the only thing they need to do is to add an async function for "ABC" that performs an online inference on a given prompt (could be http or grpc) then add "ABC"
to ASYNC_REQUEST_FUNCS
without needing to touch the main benchmark script.
We could refactor these async request functions into a class, but again that'll still require implementations of class methods to send request & parse outputs, which are essentially what these functions are doing.
Hi Roger, thanks for the PR. Yeah, the design looks good to us, please go ahead. The only thing we want to confirm is that the performance numbers are similar before and after refactoring. |
Thank you for the response! Sounds good and I'll keep iterating on the PR, and as we discussed offline, TTFT will be added to the metrics to be measured. |
@LiuXiaoxuanPKU @simon-mo Here's an output from running the main branch version of main branch
This branch
One note is that Deepspeed-mii currently does not support streaming, so TTFT will be 0 as a placeholder (I've commented about this in the code itself too) I can share more results but let me know what you think, Thanks! |
Hi @simon-mo! I've refactored the script with dataclasses and edited the serving benchmark portion in the CI. A few last questions I have in mind:
|
The goal of this PR is to refactor the current online serving benchmark script to make it easier to use and contribute to as well as include more features. Some major items are
Some other items that can be included are:
benchmark_latency.py
where we run synchronous requests against the server to measure the best-scenario latency performance of the engine/backend)