Skip to content

[RFC]: Implement disaggregated prefilling via KV cache transfer #5557

Closed as not planned
@KuntaiDu

Description

@KuntaiDu

Motivation.

There are more and more use cases, where we need to transfer KV caches between vLLM instances, or store KV caches for future use. Some concrete use cases:

  • Disaggregated prefilling. In this case, the KV cache needs to be transferred from the prefilling instances to the decoding instances
  • The user want to query a fixed set of long documents (examples: software manual, internal documents, etc). In this case, the GPU memory + CPU memory may not be enough to store the KV cache of all documents, and we may want to storage the KV cache of these documents and move them to GPU on-demand.

Proposed Change.

My current thought is to introduce two new abstractions: communicator and KV database. The workflow will be

vllm <--> communicator <--> KV database

where

  • The communicator transfer the data from src to dst, where both src and dst can be a KV block in vllm, or an entry in database
  • The KV database is a database using the hash (generated in automatic prefix caching) as the key, the corresponding KV cache tensor as the value.

This will be a huge framework, with a wide range of challenging (but fun!) questions inside, including but not limited to:

  • How to leverage infrastructures like NVLink to transfer KV cache faster?
  • How to properly pipeline the KV cache transfer?
  • How to make sure the blocks are not swapped out when the communicator is working?
  • Compress KV cache during transfer or not? If so, which compression algorithm? Who compresses the cache?

Feel free to post any thoughts on the design! Is it good? Is this abstraction able to achieve the optimal performance in your use cases?

Feedback Period.

Several weeks

CC List.

@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions