Description
Name and Version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
version: 4820 (1a24c46)
built with MSVC 19.42.34435.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 5900X + RX 5700 XT
Models
Any model that has Q8_0 tensors in it.
Problem description & steps to reproduce
Complete gibberish/noise output.
I noticed this issue with stable-diffusion.cpp at first, but I can reproduce it here.
To reproduce, simply start inference with any q8_0 model, with -ngl
set to anything but 0.
First Bad Commit
Relevant log output
Example command:
.\build\bin\Release\llama-cli.exe -m .\models\gemma-2b-Q8_0.gguf -no-cnv -ngl 19 -t 6 -tb 12 -p "The meaning of life is"
Output:
The meaning of life is increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa
llama_perf_sampler_print: sampling time = 13.57 ms / 87 runs ( 0.16 ms per token, 6410.26 tokens per second)
llama_perf_context_print: load time = 2080.08 ms
llama_perf_context_print: prompt eval time = 23.16 ms / 6 tokens ( 3.86 ms per token, 259.09 tokens per second)
llama_perf_context_print: eval time = 879.56 ms / 80 runs ( 10.99 ms per token, 90.95 tokens per second)
llama_perf_context_print: total time = 936.59 ms / 86 tokens
Interrupted by user
Reverting fbeda90 fixes it.
Older q2_k/q3_k related issue (fixed by adc5dd9 #11081 )
Name and Version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
version: 4277 (c5ede38)
built with MSVC 19.41.34120.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 5900X + RX 5700 XT
Models
Any model that has Q3_K or Q2_K tensors in it.
Problem description & steps to reproduce
Complete gibberish/noise output.
I noticed this issue with stable-diffusion.cpp at first, but I can reproduce it here.
To reproduce, simply start inference with any q3_k_x or q2_k_x model, with -ngl
set to anything but 0.
First Bad Commit
Relevant log output
Example command:
.\build\bin\Release\llama-cli.exe -m .\models\Mistral-7B-v0.2-hf-Q3_K_L.gguf -ngl 24 -t 6 -tb 12 -p "The meaning of life is"
Output:
The meaning of life is to- kur m jel ul tawa Computkow Ydorfico oobeckagles “anga ACenzei Roose Asto__(ingle Phillieraspace TheFAILEDello securózannieloilloemente GabrielóniałrivatemulticolManocaluckangle>@‑inghamulle pagina Steinentoadyodenzes Armindowtexlä v Ronald incre bioExitocyniadelphiaumper globutescison sear lifestyle proto Kotiek po cadutes Eng randCl byaginganziagedrafla cad- extern met externward Kyere collectenteryenta divisionsExternaleryy Aubore2� Yale randomirkFBimanneman hyd BrowFB Maj Majalaky audanning Ex ternal -neylitter Intentanningky amaperlDsek Britats unit andraportyo am… Egyptian portionandraandeentob – indirectibaentoicigeb associate1田 ##icijays Lyiana auditentoawPy import Girapy TheMky X Himery departmentyyyiba1iba indirect n #isterschaftciProrico Industrial #aniric Palm indirectBici patPyy –hetriky ### AtlantaidleBazialaaran Mediterranean matter sl m South experekylie------ofsy Meyainsottoannedento- corporBOestic /******/entopythonats eternainsalian Gir expery # Sar‟eloalfentaahaelfonomPal rigidento bon bon Pdas palanda P Muhammadentoít SubPy ###GAentoeterenta Palm Kabâ Cecenta8entonuoltyBotaueraperendlento Ec pyento externâ accentburgaper Klaly
llama_perf_sampler_print: sampling time = 12.93 ms / 319 runs ( 0.04 ms per token, 24665.58 tokens per second)
llama_perf_context_print: load time = 3158.38 ms
llama_perf_context_print: prompt eval time = 262.98 ms / 6 tokens ( 43.83 ms per token, 22.82 tokens per second)
llama_perf_context_print: eval time = 13525.70 ms / 312 runs ( 43.35 ms per token, 23.07 tokens per second)
llama_perf_context_print: total time = 13823.14 ms / 318 tokens
Interrupted by user
Reverting 4a57d36 fixes it.