KV cache swapping behaviour? #17283

dark-penguin · 2025-11-15T08:39:17Z

dark-penguin
Nov 15, 2025

I can't find any decent information that would shed light on what's going on. Could someone please point me in the right direction or explain how this works?

I run a model with one slot and context size matched to my free VRAM after loading the model, i.e. weights, KV cache and compute buffers all fit in VRAM (-ngl 1 -c 77000 -np 1). I get 60 tps.

Then I increase the number of slots to 2 and increase context size twice, not changing anything else (-ngl 1 -c 154000 -np 2).
My understanding is:

As long as I only use the first slot, it's all in VRAM.
If I use the second slot, its KV cache gets swapped into VRAM, and the first slot's cache gets evicted into RAM, so performance stays the same as long as I'm only using one slot at a time.

Nope - I get 10 tps, even with a very short prompt (10 tokens).

Then how does KV cache offloading work?

Same thing, different situation: I have one slot with 154000 context, but I only have enough free VRAM for 77000. I do a prompt with 10000 context. Expectations: this fits in VRAM, as long as my actual prompt does not grow over 77000. Reality: 10 tps from the beginning. Why?

Answered by yychyo

Mar 25, 2026

This is addressed by #20993, idle slots' KV is cleared from VRAM when LLAMA_KV_KEEP_ONLY_ACTIVE=1 is set.

View full answer

manfred-lindmark · 2026-01-09T13:29:08Z

manfred-lindmark
Jan 9, 2026

I'm also confused by the caching behavior. You're right that context per slot is halved if setting -np 2, so by doubling server context you get the same per slot. But what I think happens is that the whole cache for all slots is kept in VRAM constantly. This could be something that only happens when you use slots. My workaround since I want to switch between many separate agents is to use the functions to save kv cache to disk and reuse a slot. This is still a lot faster than recomputing long message histories but obviously not ideal. There should really be a way to make it work as you say, by offloading to RAM.

0 replies

yychyo · 2026-03-25T14:41:40Z

yychyo
Mar 25, 2026

This is addressed by #20993, idle slots' KV is cleared from VRAM when LLAMA_KV_KEEP_ONLY_ACTIVE=1 is set.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache swapping behaviour? #17283

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

KV cache swapping behaviour? #17283

Uh oh!

dark-penguin Nov 15, 2025

Replies: 2 comments

Uh oh!

manfred-lindmark Jan 9, 2026

Uh oh!

yychyo Mar 25, 2026

dark-penguin
Nov 15, 2025

manfred-lindmark
Jan 9, 2026

yychyo
Mar 25, 2026