Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots

### Name and Version

load_backend: loaded RPC backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-cpu-zen4.dll
version: b7974(1e8924fd6)
built with Clang 19.1.5 for Windows x86_64

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server.exe --no-mmap -kvu --parallel 4 -b 4096 -ub 1024 --jinja --slots --metrics --verbose-prompt --ctx-size 196608 -m "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" --mmproj "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct.mmproj-f16.gguf"
```

### Problem description & steps to reproduce

When running llama-server with -kvu and --parallel 4, any inactive slots ("is_processing": false) slow down token generation.

To reproduce

### First Bad Commit

I am not 100% sure when it first happened, but I have run into it a couple of months ago as well, but didn't look very deeply then.

### To reproduce:

Start llama-server with -kvu and --parallel 4

Send a single request with longer prompt. Everything works as expected.

Send another request and slightly modify the start of the prompt to avoid caching. This should put the request in a different slot. The tg throughput should be slightly lower.
If you look at /slots endpoint while it's processing, you'll see that one slot is active ("is_processing": true), one slot is inactive (full JSON populated, "is_processing": false) and two slots are blank/uninitialized (no "id_task" or "params" in JSON structure).

Sending 3rd, then 4th (after 3rd finishes) request will result in further tg slowdown, and during 4th request you should have 3 inactive slots and 1 active.

From here on, all slots have full JSON structure (with "id_task" and "params" populated), but any further requests will have tg slowed down until llama-server is restarted.

It seems that when -kvu is used, the context of 'inactive' slots ("is_processing": false, with "id_task" and "params" populated) is slowing down tg speed of 'active' slots. And pp is not affected.

Same behavior on CUDA and Vulkan (AMD). (both on Windows)

Attached is a screenshot from llama-swap that shows 6 sequential requests right after llama-server restart and the tg slowdown over the first 4 requests as the unitialized/blank slots get populated. From there, the tg stays slow until llama-server is restarted.

<img width="1360" height="515" alt="Image" src="https://github.com/user-attachments/assets/771cb1fc-7768-4594-b849-a7a4e43d1fd8" />

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots #19523

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

To reproduce:

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots #19523

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

To reproduce:

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions