Skip to content

geerlingguy/ai-benchmarks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AI/LLM Benchmarks (Ollama and llama.cpp)

.github/workflows/shellcheck.yaml

This repository contains AI/LLM benchmarks and benchmarking data compiled by Jeff Geerling, using a combination of Ollama and llama.cpp.

Benchmarking AI models can be a bit daunting, because you have to deal with hardware issues, OS issues, driver issues, stability issues... and that's all before deciding on:

  1. What models to benchmark (which quantization, what particular gguf, etc.?)
  2. How to benchmark the models (what context size, with or without features like flash attention, etc.?).
  3. What results to worry about (prompt processing speed, generated tokens per second, etc.?)

Llama.cpp Benchmark

Most of the time I rely on llama.cpp, as it is more broadly compatible, works with more models on more systems, and incorporates features that are useful for hardware acceleration more quickly than Ollama. For example, Vulkan was supported for years in llama.cpp prior to Ollama supporting it. Vulkan enables many AMD and Intel GPUs (as well as other Vulkan-compatible iGPUs) to work for LLM inference.

Right now I don't have a particular script to assist with my llama.cpp benchmarks, I just pull a model manually, then use llama.cpp's built-in llama-bench utility:

# Download a model (gguf)
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && cd ..

# Run a benchmark
./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2

You can change various llama-bench options to test different prompt and context sizes, enable or disable features like MMIO and Flash Attention, etc.

I generally start with Llama 3.2:3B Q4_K_M because it's a small model (only 2GB), and it doesn't crash even with smaller systems like SBCs.

Ollama Benchmark

The first and simplest benchmarks I often run—at least on systems where Ollama is supported and runs well—is my obench.sh script. It can run a predefined benchmark on Ollama one to many times, and generate an average score.

For a quick installation of Ollama, try:

curl -fsSL https://ollama.com/install.sh | sh

If you're not running Linux, download Ollama from the official site.

Verify you can run ollama with a given model:

ollama run llama3.2:3b

Then run this benchmark script:

./obench.sh

Uninstall Ollama following the official uninstall instructions.

For the benchmarks I save in this project, I usually run the following benchmark command, which generates an average from three runs and prints it in markdown:

./obench.sh -m llama3.2:3b -c 3 --markdown

Ollama benchmark CLI Options

Usage: ./obench.sh [OPTIONS]
Options:
 -h, --help      Display this help message
 -d, --default   Run a benchmark using some default small models
 -m, --model     Specify a model to use
 -c, --count     Number of times to run the benchmark
 --ollama-bin    Point to ollama executable or command (e.g if using Docker)
 --markdown      Format output as markdown

Findings

DeepSeek R1 14b

System CPU/GPU Eval Rate Power (Peak)
Pi 5 - 16GB CPU 1.20 Tokens/s 13.0 W
Pi 5 - 16GB (AMD Pro W77001) GPU 19.90 Tokens/s 164 W
GMKtek G3 Plus (Intel N150) - 16GB CPU 2.13 Tokens/s 30.3 W
Radxa Orion O6 - 16GB CPU 4.33 Tokens/s 34.7 W
Radxa Orion O6 - 16GB (Nvidia RTX 3080 Ti) GPU 64.58 Tokens/s 465 W
M1 Ultra (48 GPU Core) 64GB GPU 35.89 Tokens/s N/A
Framework Mainboard (128GB) CPU 11.37 Tokens/s 140W

DeepSeek R1 671b

System CPU/GPU Eval Rate Power (Peak)
AmpereOne A192-32X - 512GB CPU 4.18 Tokens/s 477 W

Llama 3.2:3b

System CPU/GPU Eval Rate Power (Peak)
Pi 400 - 4GB CPU 1.60 Tokens/s 6 W
Pi 5 - 8GB CPU 4.61 Tokens/s 13.9 W
Pi 5 - 16GB CPU 4.88 Tokens/s 11.9 W
Pi 500+ - 16GB CPU 5.55 Tokens/s 13 W
GMKtec G3 Plus (Intel N150) - 16GB CPU 9.06 Tokens/s 26.4 W
Pi 5 - 8GB (AMD RX 6500 XT1) GPU 39.82 Tokens/s 88 W
Pi 5 - 8GB (AMD RX 6700 XT1) 12GB GPU 49.01 Tokens/s 94 W
Pi 5 - 8GB (AMD RX 76001) GPU 48.47 Tokens/s 156 W
Pi 5 - 8GB (AMD Pro W77001) GPU 56.14 Tokens/s 145 W
Pi 500+ - 16GB (Intel Arc Pro B501) GPU 29.80 Tokens/s 78.5 W
Pi 500+ - 16GB (Intel Arc B5801) GPU 47.38 Tokens/s 146 W
Pi 500+ - 16GB (AMD RX 7900 XT1) GPU 108.58 Tokens/s 315 W
Pi 500+ - 16GB (AMD RX 9070 XT1) GPU 89.63 Tokens/s 304 W
M4 Mac mini (10 core - 32GB) GPU 41.31 Tokens/s 30.1 W
M1 Max Mac Studio (10 core - 64GB) GPU 59.38 Tokens/s N/A
M1 Ultra (48 GPU Core) 64GB GPU 108.67 Tokens/s N/A
HiFive Premier P550 (4-core RISC-V) CPU 0.24 Tokens/s 13.5 W
DC-ROMA Mainboard II (8-core RISC-V) CPU 0.31 Tokens/s 30.6 W
Ryzen 9 7900X (Nvidia 4090) GPU 237.05 Tokens/s N/A
Intel 13900K (Nvidia 5090) GPU 271.40 Tokens/s N/A
Intel 13900K (Nvidia 4090) GPU 216.48 Tokens/s N/A
Ryzen 9 9950X (AMD 7900 XT) GPU 131.2 Tokens/s N/A
Ryzen 9 7950X (Nvidia 4080) GPU 204.45 Tokens/s N/A
Ryzen 9 7950X (Nvidia 4070 Ti Super) GPU 198.95 Tokens/s N/A
Ryzen 9 5950X (Nvidia 4070) GPU 160.72 Tokens/s N/A
System76 Thelio Astra (Nvidia A400) GPU 35.51 Tokens/s 167 W
System76 Thelio Astra (Nvidia A4000) GPU 90.92 Tokens/s 244 W
System76 Thelio Astra (AMD Pro W77001) GPU 89.31 Tokens/s 261 W
AmpereOne A192-32X (512GB) CPU 23.52 Tokens/s N/A
Framework Mainboard (128GB) GPU 88.14 Tokens/s 133W

Llama 3.1:70b

System CPU/GPU Eval Rate Power (Peak)
M1 Max Mac Studio (10 core - 64GB) GPU 7.25 Tokens/s N/A
Ryzen 9 7900X (Nvidia 4090) GPU/CPU 3.10 Tokens/s N/A
AmpereOne A192-32X (512GB) CPU 3.86 Tokens/s N/A
Framework Mainboard (128GB) GPU 4.47 Tokens/s 139W
Raspberry Pi CM5 Cluster (10x 16GB) CPU 0.85 Tokens/s 70W

1 These GPUs were tested using llama.cpp with Vulkan support.

Llama 3.1:405b

System CPU/GPU Eval Rate Power (Peak)
AmpereOne A192-32X (512GB) CPU 0.90 Tokens/s N/A
Framework Mainboard Cluster (512GB) GPU 0.71 Tokens/s N/A

Further Reading

These benchmarks are in no way comprehensive, and I normally only compare one aspect of generative AI performance—inference tokens per second. There are many other aspects that are as important (or more important) my benchmarking does not cover, though sometimes I get deeper into the weeds in individual issues.

See All about Timing: A quick look at metrics for LLM serving for a good overview of other metrics you may want to compare.

Author

This benchmark was originally based on the upstream project tabletuser-blogspot/ollama-benchmark, and is maintained by Jeff Geerling.

About

Simple AI/LLM benchmarking tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%