Performance of llama.cpp with Vulkan #10879
Replies: 53 comments 84 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
Integrated GPU of Ultra 7 Processor 165U, installed the intel gpu driver ./llama.cpp/build-20250217-Vulkan/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) ./llama.cpp/build-20250211-sycl/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) AMD external GPU(7600M XT) connected via a Thunderbolt 4.0 USB-C ./llama.cpp/build-20250217-Vulkan/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
./llama.cpp/build-20250213-hip/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) For me, the Vulkan version is not my first choice, no matter use external gpu or not. |
Beta Was this translation helpful? Give feedback.
-
For fun here's TinyLlama 1.1B on a Nvidia 920MX (2GB Maxwell chip with only 14.4 GB/s memory bandwidth 🤣). It's too old for our CUDA implementation but Vulkan manages to run fine.
|
Beta Was this translation helpful? Give feedback.
-
On a Framework AMD 7840u with two 5600MHz memory sticks: ggml_vulkan: Found 1 Vulkan devices:
build: 70680c4 (4793) |
Beta Was this translation helpful? Give feedback.
-
There is a Phoronix's article about @jeffbolznv talk at Vulkanised2025. Very interesting talk. |
Beta Was this translation helpful? Give feedback.
-
On a Radeon RX 6600 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
build: 0fd7ca7 (1) |
Beta Was this translation helpful? Give feedback.
-
Another oddity, an AMD BC-250 mining blade, which is a cut-down PS5 APU. The 16GB of GDDR6 are left as they are on PS5 as far as I can tell, but it has 6 CPU cores instead of 8, and 24 RDNA1 CUs instead of 36. It just recently got Mesa support and can now run the Vulkan llama.cpp backend with decent performance: ggml_vulkan: Found 1 Vulkan devices:
build: cf2270e (4902)
build: cf2270e (4902) |
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
To have some compare with possible perf, I run some more test to compare backend. The V1/V2 are a test WIP backend I create using hip for RDNA3 iGPU and only compute the matmul (BF16 for now) the CPU result use BF16 too, Vulkan use FP16. Run on update Fedora41 OS. on a Ryzen 9 7940HS (with Radeon 780M iGPU) Llama-3.2-1B-Instruct/BF16.gguf
Llama-3.2-3B-Instruct
Meta-Llama-3.1-8B-Instruct
Mistral-Nemo-Instruct-2407
Mistral-Small-24B-Instruct-2501
As you can see for now Vulkan backend don't like big fp16 model (I need to make some OS change for Mistral-Small bench on Vuikan...) |
Beta Was this translation helpful? Give feedback.
-
Radeon RX 9070 XT on Arch w/
build: d84635b (4920) |
Beta Was this translation helpful? Give feedback.
-
5700G, gfx90c, 8 CU, 2x32GB@3200 ggml_vulkan: Found 1 Vulkan devices:
build: d84635b (4920) CPU results for reference:
build: d84635b (4920) 55% speedup for pp512 and lower power usage ROCm v5.7 results for reference:
build: 8ba95dc (4896) |
Beta Was this translation helpful? Give feedback.
-
Also since I have it around - a laptop, Ryzen 7 7730U w/ Vega 8 iGPU:
build: d84635b (4920) |
Beta Was this translation helpful? Give feedback.
-
5800H 2x16GB@3200, STAPM limit 80, basically the laptop version of 5700G. to skip the dGPU: ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) The dGPU, RTX 3060 laptop max-q 80W, 6GB VRAM, Driver Version: 550.120 ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) 18% below the desktop 3060 in pp512 and with flash attention, somehow it's much slower [edit: later I realized only beta and yet to be released driver v575 and upwards have support for coopmat2 and I tested with v550 limited to KHR_coopmat): ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) |
Beta Was this translation helpful? Give feedback.
-
AMD Ryzen 5 5600H ggml_vulkan: Found 1 Vulkan devices:
build: 0bb2919 (4991) |
Beta Was this translation helpful? Give feedback.
-
All tested in Windows 11, A770 LE 16G driver is 32.0.101.6653 .\source\repos\llama-cpp-vulkan> .\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -ngl 100
build: a8a1f33 (5010) .\source\repos\llama-cpp-ipx> .\llama-bench.exe -m ..\llama-cpp-vulkan\llama-2-7b.Q4_0.gguf -ngl 100
build: 4cfa0b8 (1) |
Beta Was this translation helpful? Give feedback.
-
... Cross-posted from the Mac thread: Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Montereyggml_vulkan: Found 2 Vulkan devices: % ./build/bin/llama-bench -m ../llm-models/Llama-2-7b-chat-f16.gguf -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null
build: 3f9da22 (5036) That 16-bit tg128 test was painful so I won't run it again here, but here's the 8- and 4- bit runs on CPU alone (using -ngl 0 flag):
|
Beta Was this translation helpful? Give feedback.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases.
Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same.
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Currently FA only works properly with coopmat2.
Beta Was this translation helpful? Give feedback.
All reactions