Replies: 1 comment 4 replies
-
Not sure if this is what you mean, but after #13194 is merged you may be able to do something like this by forcing SWA to be enabled, but that may require modifying the |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
would there be value in exploring the creation of an override flag similar to --override-tensor that allows for overriding the size of the KV cache per layer? ie a model with 21 layers (one embedding layer) and a trained context of 32k, would likely have a multi-GB KV cache without overriding; but if overridden, 19 layers could have their KV cache size overridden so that they only have 1024 entries for example, and one layer would attend to the full 32K context?
conceptually it's similar to memorizing transformers, where the models had a trained context of 512 to 1024 tokens, and then for long dependencies, one layer (a later layer, something in the last quarter of the model's layers) was modified so that it would do a k-Nearest Neighbour lookup of a separate non-differentiable KV cache that could hold up to 262k tokens.
I have no expectations that this would result in good perplexity with a typical model as it might take finetuning a model for it to result in adequate performance, but I think there'd be value in exploring it? or at the least someone with more experience and knowledge pondering the idea.
Beta Was this translation helpful? Give feedback.
All reactions