Skip to content

server: bench: continuous performance testing #6233

Closed
@phymbert

Description

@phymbert

Motivation

llama.cpp is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.

All these factors have an impact on the server performances, especially the following metrics:

  1. latency: pp (prompt processing) + tg (tokens generation) per request
  2. server latency: total pp+tg per second across all requests with continuous batching
  3. concurrency: how many concurrent request/users the server can handle in parallel
  4. VRAM usage
  5. RAM usage
  6. GPU usage
  7. CPU usage

It is important to monitor and control the impact of the codebase evolution on these metrics,
example from:

prompt_tokens_seconds

Since #5941, we have a server bench framework, we can now trigger it based on different events:

  1. scheduled on master branch
  2. on PR pushes

The approach should be reproducible: use the same hardware architecture, same models size and quants.

It would be nice to follow performances changes on a time series graph like it is done
in Apache Lucene.

Proposed approach

Bench will run on a T4 GPU node in Azure
Cloud, so:

  • Standard_NC4as_T4_v3
  • 20.04.1-Ubuntu
  • 4 VCPU
  • 28GB RAM
  • 1 NVidia Tesla T4
  • 16GB VRAM
  • /dev/sdb, 256GB standard SSD, mounted at /
  • /dev/sda, 1T premium SSD, mounted at /mnt

On
a GitHub self-hosted runners
with prometheus installed.

A GitHub workflow, will:

  1. build the server target using cmake Release build type and LLAMA_CUDA with native CUDA architecture
  2. for each bench parameters
  3. start the server
  4. configure prometheus scrapping on the server instance
  5. wait for the server to start
  6. build the relevant dataset for the test
  7. start performance test scenario using the right dataset
  8. export the results to json
  9. Download prometheus metrics graph
  10. plot results into time series images
  11. Add a comment in the PR with the metrics results images

Technical consideration

One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in examples/server/bench folder.

GitHub Self-Hosted runner security

Self-hosted runner security:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your
public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request
that
executes the code in a workflow.

By design, we will be using just-in-time runners:

  1. with ggml-ci in a docker container, loop look for new workflow job waiting for the host GPU series type label:
  2. Create configuration for a just-in-time runner with this label
  3. Start a rootless docker container with nvidia docker runtime with the JIT configuration token
  4. start the GitHub runner within the container
  5. wait for the container to exit
  6. restart the loop

As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.

Server scenario parameters matrix

scenario duration users hf-repo hf-file model-alias model-size model-type ngl parallel ctx-size batch-size ubatch-size n-predict grp-attn-n grp-attn-w embeddings CUDA_VISIBLE_DEVICES SERVER_BENCH_N_PROMPTS SERVER_BENCH_MAX_PROMPT_TOKENS SERVER_BENCH_MAX_CONTEXT
completions 10m 8 TODO phi2 3B F16 33 8 16384 2048 256 2048 1 512 false 0 1000 1024 1024
completions 10m 8 ggml-org/models phi-2/ggml-model-q4_0.gguf phi2 3B MOSTLY_Q4_K_M 33 8 16384 2048 256 2048 1 512 false 0 1000 1024 1024
embeddings 5m 8 ggml-org/models bert-bge-large/ggml-model-f16.gguf bert-bge-large ? F16 TODO 8 16384 4096 4096 NA NA NA true 0 1000 4096 NA

In addition, following parameters will be used:

  • --log-disable no need to have a log file
  • --metrics to allow prometheus metrics scrapping
  • --cont-batching, probably need to enable by default server: enable --cont-batching by default #6229
  • --threads 1, we will test only with all layers offloaded to GPU
  • --threads-batch 1, we will test only with all layers offloaded to GPU
  • --model ggml-model.gguf as now we can download anything from HF
  • --defrag-thold 0.1

Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.

Dataset consideration

  1. dataset must contain system, assistant and user prompts (in order to test chat template overhead if any)
  2. random must not be used to select prompt, running the test twice must output almost the same metrics
  3. it must be possible to select prompts in order they fit in KV Cache (or not) using parameters listed
    in bench/README.md:
    • SERVER_BENCH_N_PROMPTS total prompts to select in the benchmark
    • SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset
    • SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt +
      predicted tokens

Selected dataset:

scenario dataset comment
completions ShareGPT_Vicuna_unfiltered taken from VLLM to have a baseline
embeddings IMDB Data suggested by @ngxson, looks good for embeddings

Tasks

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions