Name and Version
load_backend: loaded RPC backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-cpu-zen4.dll
version: b7974(1e8924f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server.exe --no-mmap -kvu --parallel 4 -b 4096 -ub 1024 --jinja --slots --metrics --verbose-prompt --ctx-size 196608 -m "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" --mmproj "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct.mmproj-f16.gguf"
Problem description & steps to reproduce
When running llama-server with -kvu and --parallel 4, any inactive slots ("is_processing": false) slow down token generation.
To reproduce
First Bad Commit
I am not 100% sure when it first happened, but I have run into it a couple of months ago as well, but didn't look very deeply then.
To reproduce:
Start llama-server with -kvu and --parallel 4
Send a single request with longer prompt. Everything works as expected.
Send another request and slightly modify the start of the prompt to avoid caching. This should put the request in a different slot. The tg throughput should be slightly lower.
If you look at /slots endpoint while it's processing, you'll see that one slot is active ("is_processing": true), one slot is inactive (full JSON populated, "is_processing": false) and two slots are blank/uninitialized (no "id_task" or "params" in JSON structure).
Sending 3rd, then 4th (after 3rd finishes) request will result in further tg slowdown, and during 4th request you should have 3 inactive slots and 1 active.
From here on, all slots have full JSON structure (with "id_task" and "params" populated), but any further requests will have tg slowed down until llama-server is restarted.
It seems that when -kvu is used, the context of 'inactive' slots ("is_processing": false, with "id_task" and "params" populated) is slowing down tg speed of 'active' slots. And pp is not affected.
Same behavior on CUDA and Vulkan (AMD). (both on Windows)
Attached is a screenshot from llama-swap that shows 6 sequential requests right after llama-server restart and the tg slowdown over the first 4 requests as the unitialized/blank slots get populated. From there, the tg stays slow until llama-server is restarted.
Relevant log output
Logs
Name and Version
load_backend: loaded RPC backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-cpu-zen4.dll
version: b7974(1e8924f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
When running llama-server with -kvu and --parallel 4, any inactive slots ("is_processing": false) slow down token generation.
To reproduce
First Bad Commit
I am not 100% sure when it first happened, but I have run into it a couple of months ago as well, but didn't look very deeply then.
To reproduce:
Start llama-server with -kvu and --parallel 4
Send a single request with longer prompt. Everything works as expected.
Send another request and slightly modify the start of the prompt to avoid caching. This should put the request in a different slot. The tg throughput should be slightly lower.
If you look at /slots endpoint while it's processing, you'll see that one slot is active ("is_processing": true), one slot is inactive (full JSON populated, "is_processing": false) and two slots are blank/uninitialized (no "id_task" or "params" in JSON structure).
Sending 3rd, then 4th (after 3rd finishes) request will result in further tg slowdown, and during 4th request you should have 3 inactive slots and 1 active.
From here on, all slots have full JSON structure (with "id_task" and "params" populated), but any further requests will have tg slowed down until llama-server is restarted.
It seems that when -kvu is used, the context of 'inactive' slots ("is_processing": false, with "id_task" and "params" populated) is slowing down tg speed of 'active' slots. And pp is not affected.
Same behavior on CUDA and Vulkan (AMD). (both on Windows)
Attached is a screenshot from llama-swap that shows 6 sequential requests right after llama-server restart and the tg slowdown over the first 4 requests as the unitialized/blank slots get populated. From there, the tg stays slow until llama-server is restarted.
Relevant log output
Logs