llmnop is a command-line tool for benchmarking the performance of Large Language Models (LLM) inference endpoints that are compatible with the OpenAI API. It measures key performance metrics like time to first token (TTFT), inter-token latency, and overall throughput under concurrent loads.
- Realistic Workload Simulation: Generates prompts with variable input and output token lengths based on a normal distribution.
- Concurrent Benchmarking: Send multiple requests in parallel to simulate real-world load.
- Detailed Performance Metrics:
- Time To First Token (TTFT)
- Inter-Token Latency (average time between subsequent tokens)
- Throughput (tokens/second)
- End-to-end Request Latency
- Detailed JSON Output: Saves detailed per-request data and a final summary report.
- Tokenizer-Aware: Uses Hugging Face
tokenizersto count tokens for prompt generation and metric calculation.
-
Quickstart (recommended)
curl -sSfL https://github.com/jpreagan/llmnop/releases/latest/download/llmnop-installer.sh | shThe installer places
llmnopin$XDG_BIN_HOMEor~/.local/bin. Ensure that directory is on yourPATHbefore runningllmnop. -
Homebrew
brew install jpreagan/tap/llmnop
-
Build from source
git clone https://github.com/jpreagan/llmnop.git cd llmnop cargo build --release
llmnop [OPTIONS] --model <MODEL>
-m, --model <MODEL>
--tokenizer <TOKENIZER>
--max-num-completed-requests <MAX_NUM_COMPLETED_REQUESTS> [default: 2]
--num-concurrent-requests <NUM_CONCURRENT_REQUESTS> [default: 1]
--mean-input-tokens <MEAN_INPUT_TOKENS> [default: 550]
--stddev-input-tokens <STDDEV_INPUT_TOKENS> [default: 150]
--mean-output-tokens <MEAN_OUTPUT_TOKENS> [default: 150]
--stddev-output-tokens <STDDEV_OUTPUT_TOKENS> [default: 10]
--results-dir <RESULTS_DIR> [default: result_outputs]
--timeout <TIMEOUT> [default: 600]
--no-progress Disable the progress bar (useful for non-interactive environments)
-h, --help Print help
-V, --version Print version
By default, llmnop uses the model name as the tokenizer for token counting.
Use --tokenizer when the served model name doesn't match a Hugging Face tokenizer name, or when you want a different tokenizer for counting.
Examples:
# Served name differs from HF tokenizer name
llmnop --model gpt-oss:20b --tokenizer openai/gpt-oss-20b
# Force a common tokenizer for consistent counting
llmnop --model openai/gpt-oss-20b --tokenizer hf-internal-testing/llama-tokenizerexport OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=token-abc123
llmnop \
--model "Qwen/Qwen3-4B" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs"This project is licensed under the Apache License 2.0.
