This repository contains AI/LLM benchmarks and benchmarking data compiled by Jeff Geerling, using a combination of Ollama and llama.cpp.
Benchmarking AI models can be a bit daunting, because you have to deal with hardware issues, OS issues, driver issues, stability issues... and that's all before deciding on:
- What models to benchmark (which quantization, what particular gguf, etc.?)
- How to benchmark the models (what context size, with or without features like flash attention, etc.?).
- What results to worry about (prompt processing speed, generated tokens per second, etc.?)
Most of the time I rely on llama.cpp, as it is more broadly compatible, works with more models on more systems, and incorporates features that are useful for hardware acceleration more quickly than Ollama. For example, Vulkan was supported for years in llama.cpp prior to Ollama supporting it. Vulkan enables many AMD and Intel GPUs (as well as other Vulkan-compatible iGPUs) to work for LLM inference.
Right now I don't have a particular script to assist with my llama.cpp benchmarks, I just pull a model manually, then use llama.cpp's built-in llama-bench
utility:
# Download a model (gguf)
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && cd ..
# Run a benchmark
./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
You can change various llama-bench
options to test different prompt and context sizes, enable or disable features like MMIO and Flash Attention, etc.
I generally start with Llama 3.2:3B Q4_K_M because it's a small model (only 2GB), and it doesn't crash even with smaller systems like SBCs.
The first and simplest benchmarks I often run—at least on systems where Ollama is supported and runs well—is my obench.sh
script. It can run a predefined benchmark on Ollama one to many times, and generate an average score.
For a quick installation of Ollama, try:
curl -fsSL https://ollama.com/install.sh | sh
If you're not running Linux, download Ollama from the official site.
Verify you can run ollama
with a given model:
ollama run llama3.2:3b
Then run this benchmark script:
./obench.sh
Uninstall Ollama following the official uninstall instructions.
For the benchmarks I save in this project, I usually run the following benchmark command, which generates an average from three runs and prints it in markdown:
./obench.sh -m llama3.2:3b -c 3 --markdown
Usage: ./obench.sh [OPTIONS]
Options:
-h, --help Display this help message
-d, --default Run a benchmark using some default small models
-m, --model Specify a model to use
-c, --count Number of times to run the benchmark
--ollama-bin Point to ollama executable or command (e.g if using Docker)
--markdown Format output as markdown
System | CPU/GPU | Eval Rate | Power (Peak) |
---|---|---|---|
Pi 5 - 16GB | CPU | 1.20 Tokens/s | 13.0 W |
Pi 5 - 16GB (AMD Pro W77001) | GPU | 19.90 Tokens/s | 164 W |
GMKtek G3 Plus (Intel N150) - 16GB | CPU | 2.13 Tokens/s | 30.3 W |
Radxa Orion O6 - 16GB | CPU | 4.33 Tokens/s | 34.7 W |
Radxa Orion O6 - 16GB (Nvidia RTX 3080 Ti) | GPU | 64.58 Tokens/s | 465 W |
M1 Ultra (48 GPU Core) 64GB | GPU | 35.89 Tokens/s | N/A |
Framework Mainboard (128GB) | CPU | 11.37 Tokens/s | 140W |
System | CPU/GPU | Eval Rate | Power (Peak) |
---|---|---|---|
AmpereOne A192-32X - 512GB | CPU | 4.18 Tokens/s | 477 W |
System | CPU/GPU | Eval Rate | Power (Peak) |
---|---|---|---|
Pi 400 - 4GB | CPU | 1.60 Tokens/s | 6 W |
Pi 5 - 8GB | CPU | 4.61 Tokens/s | 13.9 W |
Pi 5 - 16GB | CPU | 4.88 Tokens/s | 11.9 W |
Pi 500+ - 16GB | CPU | 5.55 Tokens/s | 13 W |
GMKtec G3 Plus (Intel N150) - 16GB | CPU | 9.06 Tokens/s | 26.4 W |
Pi 5 - 8GB (AMD RX 6500 XT1) | GPU | 39.82 Tokens/s | 88 W |
Pi 5 - 8GB (AMD RX 6700 XT1) 12GB | GPU | 49.01 Tokens/s | 94 W |
Pi 5 - 8GB (AMD RX 76001) | GPU | 48.47 Tokens/s | 156 W |
Pi 5 - 8GB (AMD Pro W77001) | GPU | 56.14 Tokens/s | 145 W |
Pi 500+ - 16GB (Intel Arc Pro B501) | GPU | 29.80 Tokens/s | 78.5 W |
Pi 500+ - 16GB (Intel Arc B5801) | GPU | 47.38 Tokens/s | 146 W |
Pi 500+ - 16GB (AMD RX 7900 XT1) | GPU | 108.58 Tokens/s | 315 W |
Pi 500+ - 16GB (AMD RX 9070 XT1) | GPU | 89.63 Tokens/s | 304 W |
M4 Mac mini (10 core - 32GB) | GPU | 41.31 Tokens/s | 30.1 W |
M1 Max Mac Studio (10 core - 64GB) | GPU | 59.38 Tokens/s | N/A |
M1 Ultra (48 GPU Core) 64GB | GPU | 108.67 Tokens/s | N/A |
HiFive Premier P550 (4-core RISC-V) | CPU | 0.24 Tokens/s | 13.5 W |
DC-ROMA Mainboard II (8-core RISC-V) | CPU | 0.31 Tokens/s | 30.6 W |
Ryzen 9 7900X (Nvidia 4090) | GPU | 237.05 Tokens/s | N/A |
Intel 13900K (Nvidia 5090) | GPU | 271.40 Tokens/s | N/A |
Intel 13900K (Nvidia 4090) | GPU | 216.48 Tokens/s | N/A |
Ryzen 9 9950X (AMD 7900 XT) | GPU | 131.2 Tokens/s | N/A |
Ryzen 9 7950X (Nvidia 4080) | GPU | 204.45 Tokens/s | N/A |
Ryzen 9 7950X (Nvidia 4070 Ti Super) | GPU | 198.95 Tokens/s | N/A |
Ryzen 9 5950X (Nvidia 4070) | GPU | 160.72 Tokens/s | N/A |
System76 Thelio Astra (Nvidia A400) | GPU | 35.51 Tokens/s | 167 W |
System76 Thelio Astra (Nvidia A4000) | GPU | 90.92 Tokens/s | 244 W |
System76 Thelio Astra (AMD Pro W77001) | GPU | 89.31 Tokens/s | 261 W |
AmpereOne A192-32X (512GB) | CPU | 23.52 Tokens/s | N/A |
Framework Mainboard (128GB) | GPU | 88.14 Tokens/s | 133W |
System | CPU/GPU | Eval Rate | Power (Peak) |
---|---|---|---|
M1 Max Mac Studio (10 core - 64GB) | GPU | 7.25 Tokens/s | N/A |
Ryzen 9 7900X (Nvidia 4090) | GPU/CPU | 3.10 Tokens/s | N/A |
AmpereOne A192-32X (512GB) | CPU | 3.86 Tokens/s | N/A |
Framework Mainboard (128GB) | GPU | 4.47 Tokens/s | 139W |
Raspberry Pi CM5 Cluster (10x 16GB) | CPU | 0.85 Tokens/s | 70W |
1 These GPUs were tested using llama.cpp
with Vulkan support.
System | CPU/GPU | Eval Rate | Power (Peak) |
---|---|---|---|
AmpereOne A192-32X (512GB) | CPU | 0.90 Tokens/s | N/A |
Framework Mainboard Cluster (512GB) | GPU | 0.71 Tokens/s | N/A |
These benchmarks are in no way comprehensive, and I normally only compare one aspect of generative AI performance—inference tokens per second. There are many other aspects that are as important (or more important) my benchmarking does not cover, though sometimes I get deeper into the weeds in individual issues.
See All about Timing: A quick look at metrics for LLM serving for a good overview of other metrics you may want to compare.
This benchmark was originally based on the upstream project tabletuser-blogspot/ollama-benchmark, and is maintained by Jeff Geerling.