[RFC]: Implement disaggregated prefilling via KV cache transfer

### Motivation.

There are more and more use cases, where we need to transfer KV caches between vLLM instances, or store KV caches for future use. Some concrete use cases:
- Disaggregated prefilling. In this case, the KV cache needs to be transferred from the prefilling instances to the decoding instances
- The user want to query a fixed set of long documents (examples: software manual, internal documents, etc). In this case, the GPU memory + CPU memory may not be enough to store the KV cache of all documents, and we may want to storage the KV cache of these documents and move them to GPU on-demand.

### Proposed Change.

My current thought is to introduce two new abstractions: communicator and KV database. The workflow will be
```
vllm <--> communicator <--> KV database
```
where
- The communicator transfer the data from `src` to `dst`, where both `src` and `dst` can be a KV block in vllm, or an entry in database
- The KV database is a database using the hash (generated in automatic prefix caching) as the key, the corresponding KV cache tensor as the value.

This will be a huge framework, with a wide range of challenging (but fun!) questions inside, including but not limited to:
- How to leverage infrastructures like NVLink to transfer KV cache faster?
- How to properly pipeline the KV cache transfer?
- How to make sure the blocks are not swapped out when the communicator is working?
- Compress KV cache during transfer or not? If so, which compression algorithm? Who compresses the cache?

Feel free to post any thoughts on the design! Is it good? Is this abstraction able to achieve the optimal performance  in your use cases?

### Feedback Period.

Several weeks

### CC List.

@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Implement disaggregated prefilling via KV cache transfer #5557

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Implement disaggregated prefilling via KV cache transfer #5557

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions