server: bench: continuous performance testing

#### Motivation

**llama.cpp** is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.

All these factors have an impact on the server performances, especially the following metrics:

1. **latency**: pp (prompt processing) + tg (tokens generation) per request
2. **server latency**: total pp+tg per second across all requests with continuous batching
3. **concurrency**: how many concurrent request/users the server can handle in parallel
4. **VRAM** usage
5. **RAM** usage
6. **GPU** usage
7. **CPU** usage

It is important to monitor and control the impact of the codebase evolution on these metrics,
example [from](https://towardsdatascience.com/increase-llama-2s-latency-and-throughput-performance-by-up-to-4x-23034d781b8c):

<p align="center">
    <img width="60%" height="60%" src="https://github.com/ggerganov/llama.cpp/assets/5741141/2f518477-941d-41e1-9427-873ca0cb9846" alt="prompt_tokens_seconds" />
</p>

Since #5941, we have a server bench framework, we can now trigger it based on different events:

1. scheduled on master branch
2. on PR pushes

The approach should be reproducible: use the same hardware architecture, same models size and quants.

It would be nice to follow performances changes on a time series graph like it is done
in [Apache Lucene](https://home.apache.org/~mikemccand/lucenebench/indexing.html).

### Proposed approach

Bench will run on a [T4 GPU node](https://learn.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series) in Azure
Cloud, so:

- Standard_NC4as_T4_v3
- 20.04.1-Ubuntu
- 4 VCPU
- 28GB RAM
- 1 NVidia Tesla T4
- 16GB VRAM
- /dev/sdb, 256GB standard SSD, mounted at /
- /dev/sda, 1T premium SSD, mounted at /mnt

On
a [GitHub self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners)
with [prometheus](https://prometheus.io/docs/introduction/first_steps/) installed.

A [GitHub workflow](https://docs.github.com/en/actions/using-workflows), will:

1. build the server target using cmake `Release` build type and `LLAMA_CUDA` with `native` CUDA architecture
2. for each bench parameters
3. start the server
4. configure prometheus scrapping on the server instance
5. wait for the server to start
6. build the relevant dataset for the test
7. start performance test scenario using the right dataset
8. export the results to json
9. Download prometheus metrics graph
10. plot results into time series images
11. Add a comment in the PR with the metrics results images

### Technical consideration

One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in `examples/server/bench` folder.

#### GitHub Self-Hosted runner security

[Self-hosted runner security](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#self-hosted-runner-security):

> Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your
> public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request
> that
> executes the code in a workflow.

By design, we will be [using just-in-time runners](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners):

1. with [ggml-ci](https://github.com/ggml-org/ci) in a docker container, loop look for new workflow job waiting for the host GPU series type label:
2. [Create configuration for a just-in-time runner with this label](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-an-organization)
3. Start a rootless docker container with nvidia docker runtime with the JIT configuration token
4. start the GitHub runner within the container
5. wait for the container to exit
6. restart the loop

As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.

### Server scenario parameters matrix

| scenario    | duration | users | hf-repo         | hf-file                            | model-alias    | model-size | model-type    | ngl  | parallel | ctx-size | batch-size | ubatch-size | n-predict | grp-attn-n | grp-attn-w | embeddings | CUDA_VISIBLE_DEVICES | SERVER_BENCH_N_PROMPTS | SERVER_BENCH_MAX_PROMPT_TOKENS | SERVER_BENCH_MAX_CONTEXT |   |
|-------------|----------|-------|-----------------|------------------------------------|----------------|------------|---------------|------|----------|----------|------------|-------------|-----------|------------|------------|------------|----------------------|------------------------|--------------------------------|--------------------------|---|
| completions | 10m      | 8     | TODO            |                                    | phi2           | 3B         | F16           | 33   | 8        | 16384    | 2048       | 256         | 2048      | 1          | 512        | false      | 0                    | 1000                   | 1024                           | 1024                     |   |
| completions | 10m      | 8     | ggml-org/models | phi-2/ggml-model-q4_0.gguf         | phi2           | 3B         | MOSTLY_Q4_K_M | 33   | 8        | 16384    | 2048       | 256         | 2048      | 1          | 512        | false      | 0                    | 1000                   | 1024                           | 1024                     |   |
| embeddings  | 5m       | 8     | ggml-org/models | bert-bge-large/ggml-model-f16.gguf | bert-bge-large | ?          | F16           | TODO | 8        | 16384    | 4096       | 4096        | NA        | NA         | NA         | true       | 0                    | 1000                   | 4096                           | NA                       |   |

In addition, following parameters will be used:

- `--log-disable` no need to have a log file
- `--metrics` to allow prometheus metrics scrapping
- `--cont-batching`, probably need to enable by default #6229
- `--threads 1`, we will test only with all layers offloaded to GPU
- `--threads-batch 1`, we will test only with all layers offloaded to GPU
- `--model ggml-model.gguf` as now we can download anything from HF
- `--defrag-thold 0.1`

Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.

### Dataset consideration

1. dataset must contain system, assistant and user prompts (in order to test chat template overhead if any)
2. random must not be used to select prompt, running the test twice must output almost the same metrics
5. it must be possible to select prompts in order they fit in KV Cache (or not) using parameters listed
   in [bench/README.md](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/bench/README.md):
    - `SERVER_BENCH_N_PROMPTS` total prompts to select in the benchmark
    - `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset
    - `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt +
      predicted tokens

Selected dataset:

| scenario    | dataset                                                                                                                                                     | comment                                                                                                                    |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| completions | [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) | taken from [VLLM](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md) to have a baseline                  |
| embeddings  | [IMDB Data](https://github.com/nas5w/imdb-data/blob/master/reviews.json)                                                                                    | [suggested](https://github.com/ggerganov/llama.cpp/pull/5941#discussion_r1518282581) by @ngxson, looks good for embeddings |

### Tasks

- [x] Have a dedicated GPU node (T4), thanks to @aigrant [for](https://aigrant.com/) [ggml](https://ggml.ai/)
- [x] [Install drivers on the GPU nodes](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup),
  was not so easy actually
    - as noted there: do not install NVidia third party repo before installing ubuntu signed shipped drivers
    - need to install `alsa-utils` in order to prevent: `could not open aplay -l` during installation
- [x] Select the right datasets
- [x] Add `install-docker.sh` in ggml-ci: https://github.com/ggml-org/ci/pull/1
- [x] Setup github-runners-manager: https://github.com/ggml-org/ci/pull/2
- [x] support curl in docker images: #6291 #6474
- [x] Write a simple GitHub workflow with k6: #6283
- [x] Comment the `--ubatch-size` option in the README: #6254
- [x] #6230
- [ ] #6293
- [x] Rewrite the bench scenario to support streaming/SSE https://github.com/grafana/k6/pull/3639 https://github.com/phymbert/xk6-sse #6495
- [ ] Write the embeddings scenario
- [x] #6292
- [x] Write a python script to wrap the bench step: start the server, run k6, collect metrics
- [ ] Add MOE model after receiving feedback about the current approach
- [ ] After some enough commit history, make a performance history dashboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: bench: continuous performance testing #6233

Motivation

Proposed approach

Technical consideration

GitHub Self-Hosted runner security

Server scenario parameters matrix

Dataset consideration

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scenario	duration	users	hf-repo	hf-file	model-alias	model-size	model-type	ngl	parallel	ctx-size	batch-size	ubatch-size	n-predict	grp-attn-n	grp-attn-w	embeddings	SERVER_BENCH_N_PROMPTS	SERVER_BENCH_MAX_PROMPT_TOKENS	SERVER_BENCH_MAX_CONTEXT
completions	10m	8	TODO		phi2	3B	F16	33	8	16384	2048	256	2048	1	512	false	1000	1024	1024
completions	10m	8	ggml-org/models	phi-2/ggml-model-q4_0.gguf	phi2	3B	MOSTLY_Q4_K_M	33	8	16384	2048	256	2048	1	512	false	1000	1024	1024
embeddings	5m	8	ggml-org/models	bert-bge-large/ggml-model-f16.gguf	bert-bge-large	?	F16	TODO	8	16384	4096	4096	NA	NA	NA	true	1000	4096	NA

scenario	dataset	comment
completions	ShareGPT_Vicuna_unfiltered	taken from VLLM to have a baseline
embeddings	IMDB Data	suggested by @ngxson, looks good for embeddings

server: bench: continuous performance testing #6233

Description

Motivation

Proposed approach

Technical consideration

GitHub Self-Hosted runner security

Server scenario parameters matrix

Dataset consideration

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions