an override flag for the size of per layer KV caches? #13568

Thellton · 2025-05-15T14:41:11Z

Thellton
May 15, 2025

would there be value in exploring the creation of an override flag similar to --override-tensor that allows for overriding the size of the KV cache per layer? ie a model with 21 layers (one embedding layer) and a trained context of 32k, would likely have a multi-GB KV cache without overriding; but if overridden, 19 layers could have their KV cache size overridden so that they only have 1024 entries for example, and one layer would attend to the full 32K context?

conceptually it's similar to memorizing transformers, where the models had a trained context of 512 to 1024 tokens, and then for long dependencies, one layer (a later layer, something in the last quarter of the model's layers) was modified so that it would do a k-Nearest Neighbour lookup of a separate non-differentiable KV cache that could hold up to 262k tokens.

I have no expectations that this would result in good perplexity with a typical model as it might take finetuning a model for it to result in adequate performance, but I think there'd be value in exploring it? or at the least someone with more experience and knowledge pondering the idea.

slaren · 2025-05-15T14:59:25Z

slaren
May 15, 2025
Maintainer

Not sure if this is what you mean, but after #13194 is merged you may be able to do something like this by forcing SWA to be enabled, but that may require modifying the llama_model::load_hparams function.

4 replies

Thellton May 15, 2025
Author

Reading through the pull request, that seems quite similar? It's certainly interacting with the same systems anyway. it's intended for llama-4 models (amongst others) if I'm recalling details of that model correctly?

edit: I'm very much suddenly feeling my understanding of how Sliding Window Attention is typically structured is letting me down/is lacking I'll have to do a bit more reading?

ggerganov May 16, 2025
Maintainer

It will work with Llama 4 too.

I have no expectations that this would result in good perplexity with a typical model as it might take finetuning a model for it to result in adequate performance, but I think there'd be value in exploring it?

I think it's a neat idea - definitely worth a try.

Thellton May 17, 2025
Author

Awesome. whilst I don't know a lick of C++, I'll get onto compiling information and hypothesises for a potential implementation for anybody interested. Hell, maybe see what I can do modifying a pre-existing safe tensors format model with python perhaps in the interim.

Thellton May 17, 2025
Author

did a bit of vibe coding with Gemini and GPT-o4-mini and tested Qwen3-0.6B on my CPU (because trying to get IPEX and Pytorch working is enough to make me tear my hair out and I like my hair...)

anyway, by the look of it, evicting the oldest token every N (N being how large a KV cache you allocated to the reduced layers) results in at best about 10 to 20 tokens past that value of coherent text before it goes and repeats itself or outputs garbled text. I can't claim to have extensively tested it, but 512 KV, 6 layers exempt with 1024 tokens of output and 305 tokens of input. conceptually, I think it could work; there's nothing programmatically stopping it, but attention networks would need to be finetune tuned so that something like this could work. I'll set up a repo for the work for anybody to check out and instructions for anybody who wants to test the concept themselves.

EDIT: link to the work done exploring https://github.com/Thellton/selective-truncation-of-model-KV
I'll be exploring finetuning Qwen3-0.6B on a google collab by the look of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

an override flag for the size of per layer KV caches? #13568

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

an override flag for the size of per layer KV caches? #13568

Uh oh!

Thellton May 15, 2025

Replies: 1 comment · 4 replies

Uh oh!

slaren May 15, 2025 Maintainer

Uh oh!

Uh oh!

Thellton May 15, 2025 Author

Uh oh!

ggerganov May 16, 2025 Maintainer

Uh oh!

Thellton May 17, 2025 Author

Uh oh!

Uh oh!

Thellton May 17, 2025 Author

Thellton
May 15, 2025

Replies: 1 comment 4 replies

slaren
May 15, 2025
Maintainer

Thellton May 15, 2025
Author

ggerganov May 16, 2025
Maintainer

Thellton May 17, 2025
Author

Thellton May 17, 2025
Author