KV cache optimization with paged attention 

### Feature request

Paged attention has been enabled by a lot of  server engine, e.g., [vllm](https://github.com/vllm-project/vllm), [tensorrt-llm](https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/runtime/kv_cache_manager.py)

### Motivation

KV cache is used to reduce computation for Decoder layer but it also bring memory overheads, for example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, the memory overhead will be performance bottleneck.

### Your contribution

No PR yet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache optimization with paged attention #27303

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development