-
Notifications
You must be signed in to change notification settings - Fork 12.4k
scripts: benchmark for HTTP server throughput #14668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scripts: benchmark for HTTP server throughput #14668
Conversation
41cf067
to
c20dc54
Compare
Comparative results for #14363 : INFO:server-bench: Benchmark duration: 202.32 s
INFO:server-bench: Request throughput: 4.94 requests/s = 296.56 requests/min
INFO:server-bench: Total prompt length: 28594 tokens
INFO:server-bench: Average prompt length: 28.59 tokens
INFO:server-bench: Average prompt latency: 21.13 ms
INFO:server-bench: Average prompt speed: 1353.26 tokens/s
INFO:server-bench: Total generated tokens: 224138
INFO:server-bench: Average generation depth: 195.85 tokens
INFO:server-bench: Average total generation speed: 1107.86 tokens/s
INFO:server-bench: Average generation speed per slot: 69.24 tokens/s / slot ![]() ![]() At an average depth of only ~200 tokens there is not much of a difference for the average performance but there is still a big gain in terms of consistency. |
|
||
|
||
def benchmark(path_server: str, path_model: str, path_log: Optional[str], port: int, n_gpu_layers: int, parallel: int, ctx_size: int, n_prompts: int, n_predict: int): | ||
num_workers: int = parallel + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of + 1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set the number of workers exactly equal to the number of slots then the server will be slightly underutilized until the Python code sends the next prompt. With one more Python thread than there are slots the Python code will already queue the next request while the server is still processing the previous ones.
|
||
def get_prompts(n_prompts: int) -> list[str]: | ||
logger.info("Loading MMLU dataset...") | ||
ret = datasets.load_dataset("cais/mmlu", "all")["test"]["question"] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this dataset become configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I intend to do going forward.
"--ctx-size", str(parallel * ctx_size), | ||
"--model", path_model, | ||
"--port", str(port), | ||
"--swa-full", # FIXME performance bad otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove this argument and enable LLAMA_SET_ROWS
does the performance become good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test it yet.
This PR adds a simple, self-contained Python script for benchmarking the throughput of the llama.cpp HTTP server. The rationale for adding this script when there is already
tools/server/bench
is that I think that that tool is more complex than it needs to be, both in terms of installation and implementation. Example output of the newserver-bench
script using an RTX 4090:The following plots are produced as well: