Skip to content

Eval bug: ~~Q2_K and Q3_K~~ Q8_0 not working on Vulkan anymore on RX 5700XT #10710

Closed
@stduhpf

Description

@stduhpf

Name and Version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
version: 4820 (1a24c46)
built with MSVC 19.42.34435.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 5900X + RX 5700 XT

Models

Any model that has Q8_0 tensors in it.

Problem description & steps to reproduce

Complete gibberish/noise output.

I noticed this issue with stable-diffusion.cpp at first, but I can reproduce it here.

To reproduce, simply start inference with any q8_0 model, with -ngl set to anything but 0.

First Bad Commit

fbeda90 (#12015)

Relevant log output

Example command:

.\build\bin\Release\llama-cli.exe -m .\models\gemma-2b-Q8_0.gguf -no-cnv -ngl 19 -t 6 -tb 12 -p "The meaning of life is"

Output:

 The meaning of life is increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa
llama_perf_sampler_print:    sampling time =      13.57 ms /    87 runs   (    0.16 ms per token,  6410.26 tokens per second)
llama_perf_context_print:        load time =    2080.08 ms
llama_perf_context_print: prompt eval time =      23.16 ms /     6 tokens (    3.86 ms per token,   259.09 tokens per second)
llama_perf_context_print:        eval time =     879.56 ms /    80 runs   (   10.99 ms per token,    90.95 tokens per second)
llama_perf_context_print:       total time =     936.59 ms /    86 tokens
Interrupted by user

Reverting fbeda90 fixes it.

Older q2_k/q3_k related issue (fixed by adc5dd9 #11081 )

Name and Version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
version: 4277 (c5ede38)
built with MSVC 19.41.34120.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 5900X + RX 5700 XT

Models

Any model that has Q3_K or Q2_K tensors in it.

Problem description & steps to reproduce

Complete gibberish/noise output.

I noticed this issue with stable-diffusion.cpp at first, but I can reproduce it here.

To reproduce, simply start inference with any q3_k_x or q2_k_x model, with -ngl set to anything but 0.

First Bad Commit

4a57d36 (#10459)

Relevant log output

Example command:

.\build\bin\Release\llama-cli.exe -m .\models\Mistral-7B-v0.2-hf-Q3_K_L.gguf -ngl 24 -t 6 -tb 12 -p "The meaning of life is"

Output:

The meaning of life is to- kur m jel ul tawa Computkow Ydorfico oobeckagles “anga ACenzei Roose Asto__(ingle Phillieraspace TheFAILEDello securózannieloilloemente GabrielóniałrivatemulticolManocaluckangle>@‑inghamulle pagina Steinentoadyodenzes Armindowtexlä v Ronald incre bioExitocyniadelphiaumper globutescison sear lifestyle proto Kotiek po cadutes Eng randCl byaginganziagedrafla cad- extern met  externward Kyere collectenteryenta‎ divisionsExternaleryy Aubore2� Yale randomirkFBimanneman hyd BrowFB Maj Majalaky audanning Ex ternal -neylitter Intentanningky amaperlDsek  Britats unit andraportyo am… Egyptian portionandraandeentob – indirectibaentoicigeb associate1田 ##icijays Lyiana auditentoawPy import Girapy TheMky X Himery  departmentyyyiba1iba indirect n #isterschaftciProrico Industrial #aniric Palm indirectBici patPyy –hetriky ### AtlantaidleBazialaaran Mediterranean matter sl m South experekylie------ofsy Meyainsottoannedento- corporBOestic /******/entopythonats eternainsalian Gir expery # Sar‟eloalfentaahaelfonomPal rigidento bon bon Pdas palanda P Muhammadentoít SubPy ###GAentoeterenta Palm Kabâ Cecenta8entonuoltyBotaueraperendlento Ec pyento externâ accentburgaper Klaly
llama_perf_sampler_print:    sampling time =      12.93 ms /   319 runs   (    0.04 ms per token, 24665.58 tokens per second)
llama_perf_context_print:        load time =    3158.38 ms
llama_perf_context_print: prompt eval time =     262.98 ms /     6 tokens (   43.83 ms per token,    22.82 tokens per second)
llama_perf_context_print:        eval time =   13525.70 ms /   312 runs   (   43.35 ms per token,    23.07 tokens per second)
llama_perf_context_print:       total time =   13823.14 ms /   318 tokens
Interrupted by user

Reverting 4a57d36 fixes it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions