-
Notifications
You must be signed in to change notification settings - Fork 12.3k
server: benchmark: chat/completions scenario and other llm servers comparison #5941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
68d1d8f
server: bench: Init a bench scenario with K6
phymbert 0b822b6
server: bench: EOL EOF
phymbert 548bc96
server: bench: PR feedback and improved k6 script configuration
phymbert ab0a59d
server: bench: remove llamacpp_completions_tokens_seconds as it inclu…
phymbert f425240
server: bench: fix doc
phymbert bed1cdd
server: bench: change gauge custom metrics to trend
phymbert 572758a
server: bench: change gauge custom metrics to trend
phymbert 06e225f
server: bench: doc add an option to debug http request
phymbert a4b0d10
server: bench: filter dataset too short and too long sequences
phymbert 29c635b
server: bench: allow to filter out conversation in the dataset based …
phymbert ba7114c
server: bench: fix assistant message sent instead of user message
phymbert c4d1b5a
server: bench: fix assistant message sent instead of user message
phymbert 5d25f74
Merge branch 'master' into hp/server/bench/init
ggerganov 52c76d5
server : add defrag thold parameter
ggerganov 6bfb80e
server: bench: select prompts based on the current iteration id not r…
phymbert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
### Server benchmark tools | ||
|
||
Benchmark is using [k6](https://k6.io/). | ||
|
||
##### Install k6 | ||
|
||
Follow instruction from: https://k6.io/docs/get-started/installation/ | ||
|
||
Example for ubuntu: | ||
```shell | ||
snap install k6 | ||
``` | ||
|
||
#### Download a dataset | ||
|
||
This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md). | ||
|
||
```shell | ||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
``` | ||
|
||
#### Download a model | ||
Example for PHI-2 | ||
|
||
```shell | ||
../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf | ||
``` | ||
|
||
#### Start the server | ||
The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`. | ||
|
||
Example: | ||
```shell | ||
server --host localhost --port 8080 \ | ||
--model ggml-model-q4_0.gguf \ | ||
--cont-batching \ | ||
--metrics \ | ||
--parallel 8 \ | ||
--batch-size 512 \ | ||
--ctx-size 4096 \ | ||
--log-format text \ | ||
-ngl 33 | ||
``` | ||
|
||
#### Run the benchmark | ||
|
||
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run: | ||
```shell | ||
k6 run script.js --duration 10m --iterations 500 --vus 8 | ||
``` | ||
|
||
The benchmark values can be overridden with: | ||
- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1` | ||
- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480` | ||
- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model` | ||
- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512` | ||
- `SERVER_BENCH_DATASET` path to the benchmark dataset file | ||
- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024` | ||
- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048` | ||
|
||
Note: the local tokenizer is just a string space split, real number of tokens will differ. | ||
|
||
Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/): | ||
|
||
```shell | ||
SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8 | ||
``` | ||
|
||
To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`. | ||
|
||
#### Metrics | ||
|
||
Following metrics are available computed from the OAI chat completions response `usage`: | ||
- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration` | ||
- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens` | ||
- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens` | ||
- `llamacpp_completion_tokens` Trend of `usage.completion_tokens` | ||
- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens` | ||
- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'` | ||
- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'` | ||
|
||
The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`. | ||
|
||
K6 metrics might be compared against [server metrics](../README.md), with: | ||
|
||
```shell | ||
curl http://localhost:8080/metrics | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
import http from 'k6/http' | ||
import {check, sleep} from 'k6' | ||
import {SharedArray} from 'k6/data' | ||
import {Counter, Rate, Trend} from 'k6/metrics' | ||
import exec from 'k6/execution'; | ||
|
||
// Server chat completions prefix | ||
const server_url = __ENV.SERVER_BENCH_URL ? __ENV.SERVER_BENCH_URL : 'http://localhost:8080/v1' | ||
|
||
// Number of total prompts in the dataset - default 10m / 10 seconds/request * number of users | ||
const n_prompt = __ENV.SERVER_BENCH_N_PROMPTS ? parseInt(__ENV.SERVER_BENCH_N_PROMPTS) : 600 / 10 * 8 | ||
|
||
// Model name to request | ||
const model = __ENV.SERVER_BENCH_MODEL_ALIAS ? __ENV.SERVER_BENCH_MODEL_ALIAS : 'my-model' | ||
|
||
// Dataset path | ||
const dataset_path = __ENV.SERVER_BENCH_DATASET ? __ENV.SERVER_BENCH_DATASET : './ShareGPT_V3_unfiltered_cleaned_split.json' | ||
|
||
// Max tokens to predict | ||
const max_tokens = __ENV.SERVER_BENCH_MAX_TOKENS ? parseInt(__ENV.SERVER_BENCH_MAX_TOKENS) : 512 | ||
|
||
// Max prompt tokens | ||
const n_prompt_tokens = __ENV.SERVER_BENCH_MAX_PROMPT_TOKENS ? parseInt(__ENV.SERVER_BENCH_MAX_PROMPT_TOKENS) : 1024 | ||
|
||
// Max slot context | ||
const n_ctx_slot = __ENV.SERVER_BENCH_MAX_CONTEXT ? parseInt(__ENV.SERVER_BENCH_MAX_CONTEXT) : 2048 | ||
|
||
export function setup() { | ||
console.info(`Benchmark config: server_url=${server_url} n_prompt=${n_prompt} model=${model} dataset_path=${dataset_path} max_tokens=${max_tokens}`) | ||
} | ||
|
||
const data = new SharedArray('conversations', function () { | ||
const tokenizer = (message) => message.split(/[\s,'".?]/) | ||
|
||
return JSON.parse(open(dataset_path)) | ||
// Filter out the conversations with less than 2 turns. | ||
.filter(data => data["conversations"].length >= 2) | ||
.filter(data => data["conversations"][0]["from"] === "human") | ||
.map(data => { | ||
return { | ||
prompt: data["conversations"][0]["value"], | ||
n_prompt_tokens: tokenizer(data["conversations"][0]["value"]).length, | ||
n_completion_tokens: tokenizer(data["conversations"][1]["value"]).length, | ||
} | ||
}) | ||
// Filter out too short sequences | ||
.filter(conv => conv.n_prompt_tokens >= 4 && conv.n_completion_tokens >= 4) | ||
// Filter out too long sequences. | ||
.filter(conv => conv.n_prompt_tokens <= n_prompt_tokens && conv.n_prompt_tokens + conv.n_completion_tokens <= n_ctx_slot) | ||
// Keep only first n prompts | ||
.slice(0, n_prompt) | ||
}) | ||
|
||
const llamacpp_prompt_tokens = new Trend('llamacpp_prompt_tokens') | ||
const llamacpp_completion_tokens = new Trend('llamacpp_completion_tokens') | ||
const llamacpp_tokens_second = new Trend('llamacpp_tokens_second') | ||
|
||
const llamacpp_prompt_tokens_total_counter = new Counter('llamacpp_prompt_tokens_total_counter') | ||
const llamacpp_completion_tokens_total_counter = new Counter('llamacpp_completion_tokens_total_counter') | ||
|
||
const llamacpp_completions_truncated_rate = new Rate('llamacpp_completions_truncated_rate') | ||
const llamacpp_completions_stop_rate = new Rate('llamacpp_completions_stop_rate') | ||
|
||
export const options = { | ||
thresholds: { | ||
llamacpp_completions_truncated_rate: [ | ||
// more than 80% of truncated input will abort the test | ||
{threshold: 'rate < 0.8', abortOnFail: true, delayAbortEval: '1m'}, | ||
], | ||
}, | ||
duration: '10m', | ||
vus: 8, | ||
} | ||
|
||
export default function () { | ||
const conversation = data[exec.scenario.iterationInInstance % data.length] | ||
const payload = { | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are ChatGPT, an AI assistant.", | ||
}, | ||
{ | ||
"role": "user", | ||
"content": conversation.prompt, | ||
} | ||
], | ||
"model": model, | ||
"stream": false, | ||
"max_tokens": max_tokens | ||
} | ||
|
||
const body = JSON.stringify(payload) | ||
|
||
let res = http.post(`${server_url}/chat/completions`, body, { | ||
headers: {'Content-Type': 'application/json'}, | ||
timeout: '300s' | ||
}) | ||
|
||
check(res, {'success completion': (r) => r.status === 200}) | ||
|
||
if (res.status === 200) { | ||
const completions = res.json() | ||
|
||
llamacpp_prompt_tokens.add(completions.usage.prompt_tokens) | ||
llamacpp_prompt_tokens_total_counter.add(completions.usage.prompt_tokens) | ||
|
||
llamacpp_completion_tokens.add(completions.usage.completion_tokens) | ||
llamacpp_completion_tokens_total_counter.add(completions.usage.completion_tokens) | ||
|
||
llamacpp_completions_truncated_rate.add(completions.choices[0].finish_reason === 'length') | ||
llamacpp_completions_stop_rate.add(completions.choices[0].finish_reason === 'stop') | ||
|
||
llamacpp_tokens_second.add(completions.usage.total_tokens / res.timings.duration * 1.e3) | ||
} else { | ||
console.error(`response: ${res.body} request=${payload}`) | ||
} | ||
|
||
sleep(0.3) | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.