Open
Description
Feature request
Paged attention has been enabled by a lot of server engine, e.g., vllm, tensorrt-llm
Motivation
KV cache is used to reduce computation for Decoder layer but it also bring memory overheads, for example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, the memory overhead will be performance bottleneck.
Your contribution
No PR yet