Skip to content

scripts: benchmark for HTTP server throughput #14668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 14, 2025

Conversation

JohannesGaessler
Copy link
Collaborator

This PR adds a simple, self-contained Python script for benchmarking the throughput of the llama.cpp HTTP server. The rationale for adding this script when there is already tools/server/bench is that I think that that tool is more complex than it needs to be, both in terms of installation and implementation. Example output of the new server-bench script using an RTX 4090:

johannesg@johannes-romed82t-00 ~/Projects/llama.cpp/scripts                                                                                                          [21:47:12]
> $ py server-bench.py --path_server /home/johannesg/Projects/llama.cpp/build/bin/llama-server --path_model /opt/models/phi_4_mini_instruct-4b-q4_0.gguf      [±8e8b5e48c ●(✹)]
Loading MMLU dataset...
Starting the llama.cpp server...
Getting the prompt lengths...
Starting the benchmark...

100%|█████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:36<00:00,  4.62it/s]

Benchmark duration:                216.70 s
Request throughput:                4.61 requests/s = 276.87 requests/min
Total prompt length:               28594 tokens
Average prompt length:             28.59 tokens
Average prompt latency:            36.30 ms
Average prompt speed:              787.81 tokens/s
Total generated tokens:            229080
Average generation depth:          216.03 tokens
Average total generation speed:    1057.11 tokens/s
Average generation speed per slot: 66.07 tokens/s / slot

The following plots are produced as well:

prompt_time gen_rate

@github-actions github-actions bot added script Script related python python script changes labels Jul 13, 2025
@JohannesGaessler
Copy link
Collaborator Author

Comparative results for #14363 :

INFO:server-bench: Benchmark duration:                202.32 s
INFO:server-bench: Request throughput:                4.94 requests/s = 296.56 requests/min
INFO:server-bench: Total prompt length:               28594 tokens
INFO:server-bench: Average prompt length:             28.59 tokens
INFO:server-bench: Average prompt latency:            21.13 ms
INFO:server-bench: Average prompt speed:              1353.26 tokens/s
INFO:server-bench: Total generated tokens:            224138
INFO:server-bench: Average generation depth:          195.85 tokens
INFO:server-bench: Average total generation speed:    1107.86 tokens/s
INFO:server-bench: Average generation speed per slot: 69.24 tokens/s / slot
prompt_time gen_rate

At an average depth of only ~200 tokens there is not much of a difference for the average performance but there is still a big gain in terms of consistency.



def benchmark(path_server: str, path_model: str, path_log: Optional[str], port: int, n_gpu_layers: int, parallel: int, ctx_size: int, n_prompts: int, n_predict: int):
num_workers: int = parallel + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of + 1 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the number of workers exactly equal to the number of slots then the server will be slightly underutilized until the Python code sends the next prompt. With one more Python thread than there are slots the Python code will already queue the next request while the server is still processing the previous ones.


def get_prompts(n_prompts: int) -> list[str]:
logger.info("Loading MMLU dataset...")
ret = datasets.load_dataset("cais/mmlu", "all")["test"]["question"] # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this dataset become configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I intend to do going forward.

@JohannesGaessler JohannesGaessler merged commit 494c589 into ggml-org:master Jul 14, 2025
51 checks passed
"--ctx-size", str(parallel * ctx_size),
"--model", path_model,
"--port", str(port),
"--swa-full", # FIXME performance bad otherwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this argument and enable LLAMA_SET_ROWS does the performance become good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants