Description
Motivation
llama.cpp is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.
All these factors have an impact on the server performances, especially the following metrics:
- latency: pp (prompt processing) + tg (tokens generation) per request
- server latency: total pp+tg per second across all requests with continuous batching
- concurrency: how many concurrent request/users the server can handle in parallel
- VRAM usage
- RAM usage
- GPU usage
- CPU usage
It is important to monitor and control the impact of the codebase evolution on these metrics,
example from:
Since #5941, we have a server bench framework, we can now trigger it based on different events:
- scheduled on master branch
- on PR pushes
The approach should be reproducible: use the same hardware architecture, same models size and quants.
It would be nice to follow performances changes on a time series graph like it is done
in Apache Lucene.
Proposed approach
Bench will run on a T4 GPU node in Azure
Cloud, so:
- Standard_NC4as_T4_v3
- 20.04.1-Ubuntu
- 4 VCPU
- 28GB RAM
- 1 NVidia Tesla T4
- 16GB VRAM
- /dev/sdb, 256GB standard SSD, mounted at /
- /dev/sda, 1T premium SSD, mounted at /mnt
On
a GitHub self-hosted runners
with prometheus installed.
A GitHub workflow, will:
- build the server target using cmake
Release
build type andLLAMA_CUDA
withnative
CUDA architecture - for each bench parameters
- start the server
- configure prometheus scrapping on the server instance
- wait for the server to start
- build the relevant dataset for the test
- start performance test scenario using the right dataset
- export the results to json
- Download prometheus metrics graph
- plot results into time series images
- Add a comment in the PR with the metrics results images
Technical consideration
One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in examples/server/bench
folder.
GitHub Self-Hosted runner security
Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your
public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request
that
executes the code in a workflow.
By design, we will be using just-in-time runners:
- with ggml-ci in a docker container, loop look for new workflow job waiting for the host GPU series type label:
- Create configuration for a just-in-time runner with this label
- Start a rootless docker container with nvidia docker runtime with the JIT configuration token
- start the GitHub runner within the container
- wait for the container to exit
- restart the loop
As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.
Server scenario parameters matrix
scenario | duration | users | hf-repo | hf-file | model-alias | model-size | model-type | ngl | parallel | ctx-size | batch-size | ubatch-size | n-predict | grp-attn-n | grp-attn-w | embeddings | CUDA_VISIBLE_DEVICES | SERVER_BENCH_N_PROMPTS | SERVER_BENCH_MAX_PROMPT_TOKENS | SERVER_BENCH_MAX_CONTEXT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
completions | 10m | 8 | TODO | phi2 | 3B | F16 | 33 | 8 | 16384 | 2048 | 256 | 2048 | 1 | 512 | false | 0 | 1000 | 1024 | 1024 | ||
completions | 10m | 8 | ggml-org/models | phi-2/ggml-model-q4_0.gguf | phi2 | 3B | MOSTLY_Q4_K_M | 33 | 8 | 16384 | 2048 | 256 | 2048 | 1 | 512 | false | 0 | 1000 | 1024 | 1024 | |
embeddings | 5m | 8 | ggml-org/models | bert-bge-large/ggml-model-f16.gguf | bert-bge-large | ? | F16 | TODO | 8 | 16384 | 4096 | 4096 | NA | NA | NA | true | 0 | 1000 | 4096 | NA |
In addition, following parameters will be used:
--log-disable
no need to have a log file--metrics
to allow prometheus metrics scrapping--cont-batching
, probably need to enable by default server: enable --cont-batching by default #6229--threads 1
, we will test only with all layers offloaded to GPU--threads-batch 1
, we will test only with all layers offloaded to GPU--model ggml-model.gguf
as now we can download anything from HF--defrag-thold 0.1
Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.
Dataset consideration
- dataset must contain system, assistant and user prompts (in order to test chat template overhead if any)
- random must not be used to select prompt, running the test twice must output almost the same metrics
- it must be possible to select prompts in order they fit in KV Cache (or not) using parameters listed
in bench/README.md:SERVER_BENCH_N_PROMPTS
total prompts to select in the benchmarkSERVER_BENCH_MAX_PROMPT_TOKENS
maximum prompt tokens to filter out in the datasetSERVER_BENCH_MAX_CONTEXT
maximum context size of the completions request to filter out in the dataset: prompt +
predicted tokens
Selected dataset:
scenario | dataset | comment |
---|---|---|
completions | ShareGPT_Vicuna_unfiltered | taken from VLLM to have a baseline |
embeddings | IMDB Data | suggested by @ngxson, looks good for embeddings |
Tasks
- Have a dedicated GPU node (T4), thanks to @aigrant for ggml
- Install drivers on the GPU nodes,
was not so easy actually- as noted there: do not install NVidia third party repo before installing ubuntu signed shipped drivers
- need to install
alsa-utils
in order to prevent:could not open aplay -l
during installation
- Select the right datasets
- Add
install-docker.sh
in ggml-ci: ci: add install-docker.sh ci#1 - Setup github-runners-manager: JIT GitHub docker runner ci#2
- support curl in docker images: support LLAMA_USE_CURL in docker images #6291 server: add cURL support to server Dockerfiles #6474
- Write a simple GitHub workflow with k6: server: continuous performance monitoring and PR comment #6283
- Comment the
--ubatch-size
option in the README: server: docs:--threads
and--threads
,--ubatch-size
,--log-disable
#6254 - server: comment --threads option behavior #6230
- server: doc: document the
--defrag-thold
option #6293 - Rewrite the bench scenario to support streaming/SSE sse: support Server Sent Event grafana/k6#3639 https://github.com/phymbert/xk6-sse ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495
- Write the embeddings scenario
- Add some models in ggml-models HF repo #6292
- Write a python script to wrap the bench step: start the server, run k6, collect metrics
- Add MOE model after receiving feedback about the current approach
- After some enough commit history, make a performance history dashboard