Skip to content

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Closed
@stduhpf

Description

@stduhpf

What happened?

I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread).

But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even -ngl 0) and it crashes with GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

Example llama-bench outputs:

Vulkan (q6-k):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl threads n_batch test t/s
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 23 6 32 pp512 512.52 ± 0.18
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 23 6 32 tg512 159.35 ± 0.32
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 22 6 32 pp512 498.63 ± 0.26
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 22 6 32 tg512 141.69 ± 0.38
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 21 6 32 pp512 462.52 ± 0.19
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 21 6 32 tg512 127.42 ± 0.55

build: ba68309d (3163)

Vulkan (iq4-xs):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl threads n_batch test t/s
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 23 6 32 pp512 98.00 ± 0.20
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 23 6 32 tg512 12.60 ± 0.03
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 22 6 32 pp512 94.57 ± 1.02
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 22 6 32 tg512 12.43 ± 0.15

GGML_ASSERT: C:[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

CPU (iq4-xs):

model size params backend threads n_batch test t/s
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 12 32 pp512 185.04 ± 4.81
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 12 32 tg512 57.17 ± 1.08
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 6 32 pp512 127.78 ± 2.52
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 6 32 tg512 61.14 ± 1.07
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 1 32 pp512 24.71 ± 0.05
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 1 32 tg512 21.14 ± 0.05

build: ba68309d (3163)

Additional info

Vulkan backend built using: cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64

The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants.

(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9ca, don't mind it)

Name and Version

version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

GGML_ASSERT:  C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions