Closed as not planned
Description
Motivation.
There are more and more use cases, where we need to transfer KV caches between vLLM instances, or store KV caches for future use. Some concrete use cases:
- Disaggregated prefilling. In this case, the KV cache needs to be transferred from the prefilling instances to the decoding instances
- The user want to query a fixed set of long documents (examples: software manual, internal documents, etc). In this case, the GPU memory + CPU memory may not be enough to store the KV cache of all documents, and we may want to storage the KV cache of these documents and move them to GPU on-demand.
Proposed Change.
My current thought is to introduce two new abstractions: communicator and KV database. The workflow will be
vllm <--> communicator <--> KV database
where
- The communicator transfer the data from
src
todst
, where bothsrc
anddst
can be a KV block in vllm, or an entry in database - The KV database is a database using the hash (generated in automatic prefix caching) as the key, the corresponding KV cache tensor as the value.
This will be a huge framework, with a wide range of challenging (but fun!) questions inside, including but not limited to:
- How to leverage infrastructures like NVLink to transfer KV cache faster?
- How to properly pipeline the KV cache transfer?
- How to make sure the blocks are not swapped out when the communicator is working?
- Compress KV cache during transfer or not? If so, which compression algorithm? Who compresses the cache?
Feel free to post any thoughts on the design! Is it good? Is this abstraction able to achieve the optimal performance in your use cases?
Feedback Period.
Several weeks
CC List.
@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU
Any Other Things.
No response