Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded)

### What happened?

I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread). 

But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even `-ngl 0`) and it crashes with `GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03`

## Example llama-bench outputs: 

### Vulkan (q6-k):
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl | threads | n_batch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | ------------: | ---------------: |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  23 |       6 |      32 |         pp512 |    512.52 ± 0.18 |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  23 |       6 |      32 |         tg512 |    159.35 ± 0.32 |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  22 |       6 |      32 |         pp512 |    498.63 ± 0.26 |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  22 |       6 |      32 |         tg512 |    141.69 ± 0.38 |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  21 |       6 |      32 |         pp512 |    462.52 ± 0.19 |
| llama 1B Q6_K                  | 860.87 MiB |     1.10 B | Vulkan     |  21 |       6 |      32 |         tg512 |    127.42 ± 0.55 |

build: ba68309d (3163)

### Vulkan (iq4-xs):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl | threads | n_batch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | ------------: | ---------------: |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | Vulkan     |  23 |       6 |      32 |         pp512 |     98.00 ± 0.20 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | Vulkan     |  23 |       6 |      32 |         tg512 |     12.60 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | Vulkan     |  22 |       6 |      32 |         pp512 |     94.57 ± 1.02 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | Vulkan     |  22 |       6 |      32 |         tg512 |     12.43 ± 0.15 |

GGML_ASSERT:  C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

### CPU (iq4-xs):

| model                          |       size |     params | backend    | threads | n_batch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | ---------------: |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |      12 |      32 |         pp512 |    185.04 ± 4.81 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |      12 |      32 |         tg512 |     57.17 ± 1.08 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |       6 |      32 |         pp512 |    127.78 ± 2.52 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |       6 |      32 |         tg512 |     61.14 ± 1.07 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |       1 |      32 |         pp512 |     24.71 ± 0.05 |
| llama 1B IQ4_XS - 4.25 bpw     | 577.42 MiB |     1.10 B | CPU        |       1 |      32 |         tg512 |     21.14 ± 0.05 |

build: ba68309d (3163)

## Additional info 

Vulkan backend built using:  `cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64`

The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants. 

(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9cab94e0b5b53cb6edeeebf8c8c799baad03, don't mind it)

### Name and Version

version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64

### What operating system are you seeing the problem on?

Windows

### Relevant log output

```
GGML_ASSERT:  C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

What happened?

Example llama-bench outputs:

Vulkan (q6-k):

Vulkan (iq4-xs):

CPU (iq4-xs):

Additional info

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	threads	n_batch	test	t/s
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	23	6	32	pp512	512.52 ± 0.18
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	23	6	32	tg512	159.35 ± 0.32
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	22	6	32	pp512	498.63 ± 0.26
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	22	6	32	tg512	141.69 ± 0.38
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	21	6	32	pp512	462.52 ± 0.19
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	21	6	32	tg512	127.42 ± 0.55

model	size	params	backend	ngl	threads	n_batch	test	t/s
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	23	6	32	pp512	98.00 ± 0.20
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	23	6	32	tg512	12.60 ± 0.03
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	22	6	32	pp512	94.57 ± 1.02
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	22	6	32	tg512	12.43 ± 0.15

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Description

What happened?

Example llama-bench outputs:

Vulkan (q6-k):

Vulkan (iq4-xs):

CPU (iq4-xs):

Additional info

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions