Closed
Description
I am working on ollama/ollama#2458 and did some benchmarks to test the performance. I compiled with commit id 3bdc4cd0
. Build segfaults with master as in #5469
I used mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s. My llama-bench command is
./build/bin/llama-bench -m models/mistral-7b-v0.1.Q4_0.gguf -p 128,256,512 -n 128,256,512
On M2 Air
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | pp 128 | 144.47 ± 0.22 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | pp 256 | 142.95 ± 1.17 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | pp 512 | 141.36 ± 0.67 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | tg 128 | 20.06 ± 0.66 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | tg 256 | 20.26 ± 0.17 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | Metal | 99 | tg 512 | 13.96 ± 1.62 |
On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | pp 128 | 18.60 ± 3.07 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | pp 256 | 20.82 ± 0.14 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | pp 512 | 22.48 ± 0.16 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | tg 128 | 10.78 ± 0.02 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | tg 256 | 10.76 ± 0.02 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 0 | tg 512 | 10.69 ± 0.01 |
On Arc 770
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | pp 128 | 407.14 ± 58.05 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | pp 256 | 583.57 ± 78.24 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | pp 512 | 757.99 ± 1.48 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | tg 128 | 24.74 ± 0.27 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | tg 256 | 24.65 ± 0.20 |
llama 7B Q4_0 | 3.83 GiB | 7.24 B | SYCL | 99 | tg 512 | 21.46 ± 2.39 |
Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low.
This is much slower than what I expected because Arc 770 is significantly faster than both M2 and 12400. You can see the benchmarks of FLOPs and BW here: https://github.com/chsasank/device-benchmarks