Description
Name and Version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600 (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none
version: 4932 (9ffcc9e)
built with MSVC 19.43.34808.0 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
I just use this .bat to start server and then use just Llama-server's webUI
echo Running Mistral small 3.1 2503 24B 12288 context
llama-server.exe ^
--model "D:\LLMs\mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M.gguf" ^
--gpu-layers 14 ^
--ctx-size 12288 ^
--temp 0.2
Problem description & steps to reproduce
Going from release b4916 (llama-b4916-bin-win-vulkan-x64) to b4932 (llama-b4932-bin-win-vulkan-x64) I noticed that model (not just this Mistral) load partially into VRAM just like before but inference is significantly slower and CPU is being roasted. This never happened with multiple previous releases.
Checking VRAM load in task manager is showing that model seems to be properly loaded (exactly like before) but performance is almost like its not using GPU at all. Literally was solid and now is terrible.
First Bad Commit
No response