docker run -p 8080:8080 --gpus all vllm/vllm-openai --model Nitral-AI/Captain-Eris_Violet-V0.420-12B --max-model-len 10000 --swap-space 4 --dtype auto --enable-chunked-prefill --disable-log-requests --enable-prefix-caching --port 8080 --root-path /api --served-model-name Nitral-AI/Captain-Eris_Violet-V0.420-12B --max-num-seqs 72 --quantization fp8 --max-num-batched-tokens 1024 --kv-cache-dtype fp8
Gives you the option to use a source prompt at the start, and add some random text after it to control the prompt cache percentage.
git clone https://github.com/FlowGPT/llm-inference-benchmarking
cd llm-inference-bench-char
pip install -r requirements.txt
python run.py --rounds 1 -q 0.5 --api-base http://localhost:8080/api/v1 --model Nitral-AI/Captain-Eris_Violet-V0.420-12B --max-tokens=250 --prompt-file prompt-1k.txt --random-tokens 3000 --use-chat
This benchmark runs 0.5 rps on a 13b model with an input of 4.5k tokens and an output of 250 tokens, prefix cache rate 20%.
source: https://github.com/leptonai/leptonai/blob/main/misc/benchmark/run.py
You can use the online data to replay the production environments.
The online_replay.py script allows you to replay requests from log file with two different modes:
- Timestamp-based replay (maintains original request timing):
python online_replay.py --input replay-logs-origin.log --replay-mode timestamp --sample-range 0.0 0.1 \
--api-base http://localhost:8080/api/v1 --model Nitral-AI/Captain-Eris_Violet-V0.420-12B --round-duration 60- QPS-based replay (controls request rate):
python online_replay.py --input replay-logs-origin.log --replay-mode qps --target-qps 5 --sample-range 0.0 0.1 \
--api-base http://localhost:8080/api/v1 --model Nitral-AI/Captain-Eris_Violet-V0.420-12B --round-duration 60-
How to choose between these two modes?
In multi-instance scenarios, the timestamp mode is recommended. In single-instance scenarios, the qps mode is sufficient. This is because in production environments, requests are routed among multiple instances, making the request arrival pattern appear uniform for each individual instance.
-
Despite similar final QPS, why does timestamp mode show higher latency?
This is caused by non-uniform request arrival patterns. According to queueing theory, when requests arrive non-uniformly (e.g., Poisson process with high variance), bursty requests can lead to temporary queue buildup, significantly increasing the average queuing time.
-
What is the detailed logging feature?
The
--detailed-logsparameter enables real-time tracking of each request's performance metrics. Each request is assigned a unique ID, and detailed information including send time, TTFT, completion time, token counts, and processing times are recorded in a CSV file. This data is written in real-time to alog/detailed_results_[timestamp].csvfile, allowing for detailed analysis of request performance patterns.
Key parameters:
--input: Input log file path--replay-mode: Replay mode (timestamp/qps)--sample-range: Sampling range [START, END) to control the percentage of requests to send (e.g., 0.0 0.2)--round-duration: Performance statistics collection period (seconds)--max-rounds: Maximum number of rounds to run--api-base: API service endpoint--model: Model name--max-token: Maximum token output of the model--use-chat: Whether to use chat interface--json-output: Output performance metrics in JSON format--verbose: Enable detailed logging output (default: False, only show statistics)--detailed-logs: Enable detailed per-request logging dir path with unique IDs (saved to CSV file)--e2e-slo: End-to-end latency SLO target in seconds (float). Example:--e2e-slo 5.0. When set, the report includes "E2E SLO Attainment" which is the percentage of total requests that are successful and have latency ≤ SLO.--ttft-slo: TTFT SLO in milliseconds (int). When set, the report includes "TTFT SLO Attainment" which is the percentage of total requests that are successful and have TTFT ≤ threshold.--tpot-slo: TPOT SLO in milliseconds (int). When set, the report includes "TPOT SLO Attainment" which is the percentage of requests whose time per output token (ms/token) ≤ threshold.
Performance metrics:
- Latency statistics
- Throughput
- TTFT (Time To First Token)
- TPOT (Time Per Output Token)
- Input/Output Tokens per Minute
- Success Rate
- E2E SLO Attainment (if
--e2e-slois provided) - TTFT SLO Attainment (if
--ttft-slois provided) - TPOT SLO Attainment (if
--tpot-slois provided)
Notes on display:
- When any SLO thresholds are set, a second line is appended to the table title to summarize the SLO attainment values (E2E/TTFT/TPOT) for quick inspection.
If you want to start a process using online_replay.py to replay qps>10, you'd better split it to multiple terminals and run them separately. By modifying the --sample-range parameter, you can specify different sampling ranges for each process. This approach helps avoid client-side issues caused by high concurrency. You can refer to run_client_split.sh for implementation details.
For example, to achieve a total QPS of 20, you can:
- Run the first process with
--target-qps 10 --sample-range 0.0 0.5in one terminal - Run the second process with
--target-qps 10 --sample-range 0.5 1.0in another terminal
This distributed approach ensures better stability and more accurate benchmarking results.
To stop processes, you can also open a new terminal and run pkill -f "online_replay.py"
When using the --detailed-logs parameter, the script generates a CSV file with the following columns:
request_id: Unique identifier for each request (UUID)conversation_id: Original conversation ID from the log filesend_time: Timestamp when the request was sentttft_time: Timestamp when the first token was receivedtotal_time: Timestamp when the response was completedtokens_in: Number of input tokenstokens_out: Number of output tokensttft: Time to first token (seconds)tpot: Time per output token (seconds)
JSON output extra fields when SLO flags are provided:
ttft_slo_ms,ttft_slo_attainmenttpot_slo_ms,tpot_slo_attainment
This data can be analyzed using any CSV-compatible tool such as pandas, Excel, or data visualization software to identify performance patterns, bottlenecks, or unusual behavior in your LLM serving system.