Skip to content

Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots #19523

@smitranic

Description

@smitranic

Name and Version

load_backend: loaded RPC backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\llm\Apps\llama-b7974-bin-win-vulkan-x64\ggml-cpu-zen4.dll
version: b7974(1e8924f)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe --no-mmap -kvu --parallel 4 -b 4096 -ub 1024 --jinja --slots --metrics --verbose-prompt --ctx-size 196608 -m "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" --mmproj "C:\Users\llm\Downloads\AI Models\Qwen3-VL-30B-A3B-Instruct.mmproj-f16.gguf"

Problem description & steps to reproduce

When running llama-server with -kvu and --parallel 4, any inactive slots ("is_processing": false) slow down token generation.

To reproduce

First Bad Commit

I am not 100% sure when it first happened, but I have run into it a couple of months ago as well, but didn't look very deeply then.

To reproduce:

Start llama-server with -kvu and --parallel 4

Send a single request with longer prompt. Everything works as expected.

Send another request and slightly modify the start of the prompt to avoid caching. This should put the request in a different slot. The tg throughput should be slightly lower.
If you look at /slots endpoint while it's processing, you'll see that one slot is active ("is_processing": true), one slot is inactive (full JSON populated, "is_processing": false) and two slots are blank/uninitialized (no "id_task" or "params" in JSON structure).

Sending 3rd, then 4th (after 3rd finishes) request will result in further tg slowdown, and during 4th request you should have 3 inactive slots and 1 active.

From here on, all slots have full JSON structure (with "id_task" and "params" populated), but any further requests will have tg slowed down until llama-server is restarted.

It seems that when -kvu is used, the context of 'inactive' slots ("is_processing": false, with "id_task" and "params" populated) is slowing down tg speed of 'active' slots. And pp is not affected.

Same behavior on CUDA and Vulkan (AMD). (both on Windows)

Attached is a screenshot from llama-swap that shows 6 sequential requests right after llama-server restart and the tg slowdown over the first 4 requests as the unitialized/blank slots get populated. From there, the tg stays slow until llama-server is restarted.

Image

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions