Description
What happened?
I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread).
But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even -ngl 0
) and it crashes with GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
Example llama-bench outputs:
Vulkan (q6-k):
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
model | size | params | backend | ngl | threads | n_batch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 23 | 6 | 32 | pp512 | 512.52 ± 0.18 |
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 23 | 6 | 32 | tg512 | 159.35 ± 0.32 |
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 22 | 6 | 32 | pp512 | 498.63 ± 0.26 |
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 22 | 6 | 32 | tg512 | 141.69 ± 0.38 |
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 21 | 6 | 32 | pp512 | 462.52 ± 0.19 |
llama 1B Q6_K | 860.87 MiB | 1.10 B | Vulkan | 21 | 6 | 32 | tg512 | 127.42 ± 0.55 |
build: ba68309d (3163)
Vulkan (iq4-xs):
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
model | size | params | backend | ngl | threads | n_batch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | Vulkan | 23 | 6 | 32 | pp512 | 98.00 ± 0.20 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | Vulkan | 23 | 6 | 32 | tg512 | 12.60 ± 0.03 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | Vulkan | 22 | 6 | 32 | pp512 | 94.57 ± 1.02 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | Vulkan | 22 | 6 | 32 | tg512 | 12.43 ± 0.15 |
GGML_ASSERT: C:[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
CPU (iq4-xs):
model | size | params | backend | threads | n_batch | test | t/s |
---|---|---|---|---|---|---|---|
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 12 | 32 | pp512 | 185.04 ± 4.81 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 12 | 32 | tg512 | 57.17 ± 1.08 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 6 | 32 | pp512 | 127.78 ± 2.52 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 6 | 32 | tg512 | 61.14 ± 1.07 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 1 | 32 | pp512 | 24.71 ± 0.05 |
llama 1B IQ4_XS - 4.25 bpw | 577.42 MiB | 1.10 B | CPU | 1 | 32 | tg512 | 21.14 ± 0.05 |
build: ba68309d (3163)
Additional info
Vulkan backend built using: cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64
The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants.
(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9ca, don't mind it)
Name and Version
version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03