-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Mooncake Transfer Engine as a new kv_connector
. The implementation provides a solid framework for KV cache transfer using Mooncake. My review focuses on improving synchronization, resource management, and configurability in the new MooncakeConnectorWorker
.
I've identified a few high-severity issues:
- An inefficient busy-wait loop for synchronization that should be replaced with a condition variable.
- Inefficient ZMQ context management that can lead to performance degradation.
- Hardcoded configuration values that limit the connector's flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialize
call for the TransferEngine
uses hardcoded values "P2PHANDSHAKE"
and "rdma"
. This reduces the flexibility of the connector and makes it difficult to adapt to different network configurations or Mooncake setups without modifying the code. These values should be made configurable, for instance, by reading them from vllm_config.kv_transfer_config.kv_connector_extra_config
.
ret_value = self.engine.initialize(self.hostname, "P2PHANDSHAKE", "rdma", "") | |
extra_config = self.vllm_config.kv_transfer_config.kv_connector_extra_config | |
mode = extra_config.get("mooncake_init_mode", "P2PHANDSHAKE") | |
protocol = extra_config.get("mooncake_protocol", "rdma") | |
device = extra_config.get("mooncake_device", "") | |
ret_value = self.engine.initialize(self.hostname, mode, protocol, device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while
loop with time.sleep(0.1)
is a form of busy-waiting for start_load_kv()
to provide the local_block_ids
. This is inefficient, can introduce unnecessary latency, and is not guaranteed to be correct under all timing conditions. A more robust and efficient approach is to use a threading.Condition
for synchronization.
This would involve:
- Modifying
SendReqMeta
to use athreading.Condition
. - In
send_kv_to_decode
, waiting on the condition variable instead of sleeping. - In
start_load_kv
, notifying waiters on the condition variable after populating the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zmq_ctx
context manager creates a new zmq.Context
on every call. This is inefficient, as ZMQ contexts are designed to be created once per process and shared among all sockets in that process. In receive_kv
, a new thread is spawned for each request, and each thread calls zmq_ctx
, leading to context creation and destruction for every KV transfer. This can be a significant performance bottleneck.
A better approach is to:
- Create a single
zmq.Context
inMooncakeConnectorWorker.__init__
. - Use this shared context to create sockets directly in
_mooncake_handshake_listener
andreceive_kv
(e.g.,with self.zmq_context.socket(zmq.REQ) as sock:
). - Terminate the context in a
shutdown
method for the worker. - The
zmq_ctx
function can then be removed.
72775a2
to
8af784a
Compare
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
8af784a
to
9dcc73d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dtcccc , thanks a lot for contributing to vLLM!
Mooncake engine was actually something I already had high on the todo list, happy to see it here.
My plan though was to use it as a Nixl backend plugin https://github.com/ai-dynamo/nixl/blob/main/src/plugins/mooncake/README.md so that we wouldn't have to re-implement a separate connector logic for it.
Have you already considered this option?
Thanks for your reply @NickLucche I am a beginner and not very familiar with nixl. However, nixl is quite hard for me to use. I prefer simple deployment, so I use "pip install" for installation. The new version (>0.3.0) gives me a UCX ERROR when I start it, while the old version throws a RuntimeError: BAD_STATUS at the decode node during execution. I am also actively looking for a solution, but this topic is unrelated to the current PR. On the other hand, I want to try directly using the mooncake transfer engine to improve performance. I have studied a lot from the NixlConnector code and referenced it, and I'm exploring ways to achieve better performance by reducing one layer of encapsulation and eliminating one handshake (P directly send kvcaches to D). I agree that this PR introduces some redundancy in the existing code. From a maintainability perspective, would it be possible to abstract common parts at the vllm code level and merge them, while still retaining the capability to directly integrate with mooncake? |
Hi, would you mind submitting this PR to the Mooncake repo first? We could test it via an external connector for now. |
Sure. Which branch and path should I submit? Could you provide a guideline? |
Copy the Python file to https://github.com/kvcache-ai/Mooncake/tree/main/mooncake-wheel/mooncake, submit it to main, and add a basic README showing how to use it. |
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Introduce Mooncake Transfer Engine named mooncake_connector for vllm v1.
This is a basic framework, and will support more features in future.
Test Plan
Running Qwen/Qwen2.5-7B-Instruct on 2 A10 servers with 1P1D. (Using
nixl_integration/toy_proxy_server.py
as lb)Running benchmark of sharegpt dataset with 100 num-prompts and default params, and comparing with P2pNcclConnector with PUT_ASYNC mode.
Note: I met some error when running NixlConnector, so only compared with P2pNcclConnector.
Test Result
Mooncake Connector
P2pNcclConnector
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.