-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Implement disaggregated prefilling via KV cache transfer #5557
Comments
After discussing, maybe it is better for us to focus on disaggregated prefilling first, and then it is much easier to tell how should we make the high-level architecture change. For disaggregated prefilling, does the following workflow sound good or not? For an upcoming request:
|
Sounds very interesting! For the second usage, I have a question
It seems to leverage the prefill caching mechanism, which require the doc is in the top of the prompt, and only the query part is different in the bottom, right? So that it could handle the case that long documents pieces along with many different query, and those top same part's kvcache would be stored inside CPU's memory? And it's better also take consideration those GPU without nvlink like 4090... For KV compression, I think maybe KV cache quanatization to 4/2bits would make this whole subsystem more valuable |
Would it make sense to first get some simple design on abstractions for handling the KV cache, before designing the transport? For example, having something like:
Would be a nice starting point. Then later maybe it can be async/lazy so that we would pipeline the state automatically |
I gave a comment offline, pasting it here:
|
I have found one more usage for storing KV cache somewhere. I suppose it would be nice to have this feature when working with agents such as chain of thoughts. It has a repetable phases of generation and appending tool's outputs. As for now, every time generation stops due to tool invocation and appending tool's outputs to the prompt, LLM then calls again. We have growing leading part of prompt which is the same inside one call of chain-of-thoughts. |
In the long document reusing case, sure CPU can be used as a layer of cache. But there are two scenarios, where using CPU as a KV cache is NOT efficient:
For those devices without NVLink, I agree with you, it would be nice if we can support it. But let's focus on make the KV transfer REALLY fast using NVLink first (which is a cool feature that trt/tgi/lmdeploy does not have), so that we can gauge more interest from other developers. For KV compression, there is a series of research that explores alternative opportunities besides simple quantization. Some pointers: |
Agree!!! A nuance here is what should be the granularity of KV cache read/write. Per vllm block or per query. My current preference is per vllm block, as the time when we need to read/save KV cache is typically tied to the decisions of block manager (e.g. we may need to read KV cache, when block manager allocates new block; or we may need to write KV cache to disk, when a KV cache is swapped out from CPU by block manager), so it is better to align the granularity with the block manager. |
Great to see the proposal! We are doing experiments to offload reusable KV contents to external cache store. Happy to discuss more details. |
My current plan is to focus on implementing disaggregated prefilling using cross-vllm-instance kv cache transfer. Two reasons:
|
Base implementation:
Foreseeable overheads (compared to an implementation):
My very first step: measure the overhead of call the prefilling function again with the KV cache. |
Sounds great! |
How to implement kv-cache transfer, nccl or rdma? |
Is this still going? |
1 similar comment
Is this still going? |
I noticed one paper that seems to implement kv cache migration: https://arxiv.org/abs/2406.03243 Their project: https://github.com/AlibabaPAI/llumnix Sorry I'm just getting into vllm and seeing this issue. I'm curious how they did it if vllm doesn't have the relevant interface support? Or, we already have a way to not implement this feature within vllm. |
Yes, I think their project implement kv cache migration. But they are doing this across the continuous decoding step not between the prefill and decode or for the future reuse. It means that the overlap between kv migration and decoding computation doesn't exist anymore because the src will not generate new token when the kv transfer happens for disaggregated prefilling. Current interface of kv cache move is copy memory between GPU and CPU using cudaMemcpyDeviceToHost, also copy memory within the same device by using cudaMemcpyDeviceToDevice. |
Motivation.
There are more and more use cases, where we need to transfer KV caches between vLLM instances, or store KV caches for future use. Some concrete use cases:
Proposed Change.
My current thought is to introduce two new abstractions: communicator and KV database. The workflow will be
where
src
todst
, where bothsrc
anddst
can be a KV block in vllm, or an entry in databaseThis will be a huge framework, with a wide range of challenging (but fun!) questions inside, including but not limited to:
Feel free to post any thoughts on the design! Is it good? Is this abstraction able to achieve the optimal performance in your use cases?
Feedback Period.
Several weeks
CC List.
@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: