KV cache swapping behaviour? #17283
-
|
I can't find any decent information that would shed light on what's going on. Could someone please point me in the right direction or explain how this works? I run a model with one slot and context size matched to my free VRAM after loading the model, i.e. weights, KV cache and compute buffers all fit in VRAM (-ngl 1 -c 77000 -np 1). I get 60 tps. Then I increase the number of slots to 2 and increase context size twice, not changing anything else (-ngl 1 -c 154000 -np 2).
Nope - I get 10 tps, even with a very short prompt (10 tokens). Then how does KV cache offloading work? Same thing, different situation: I have one slot with 154000 context, but I only have enough free VRAM for 77000. I do a prompt with 10000 context. Expectations: this fits in VRAM, as long as my actual prompt does not grow over 77000. Reality: 10 tps from the beginning. Why? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
I'm also confused by the caching behavior. You're right that context per slot is halved if setting -np 2, so by doubling server context you get the same per slot. But what I think happens is that the whole cache for all slots is kept in VRAM constantly. This could be something that only happens when you use slots. My workaround since I want to switch between many separate agents is to use the functions to save kv cache to disk and reuse a slot. This is still a lot faster than recomputing long message histories but obviously not ideal. There should really be a way to make it work as you say, by offloading to RAM. |
Beta Was this translation helpful? Give feedback.
-
|
This is addressed by #20993, idle slots' KV is cleared from VRAM when |
Beta Was this translation helpful? Give feedback.
This is addressed by #20993, idle slots' KV is cleared from VRAM when
LLAMA_KV_KEEP_ONLY_ACTIVE=1is set.