[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718

dtcccc · 2025-09-12T04:50:26Z

Purpose

Introduce Mooncake Transfer Engine named mooncake_connector for vllm v1.
This is a basic framework, and will support more features in future.

Test Plan

Running Qwen/Qwen2.5-7B-Instruct on 2 A10 servers with 1P1D. (Using nixl_integration/toy_proxy_server.py as lb)
Running benchmark of sharegpt dataset with 100 num-prompts and default params, and comparing with P2pNcclConnector with PUT_ASYNC mode.
Note: I met some error when running NixlConnector, so only compared with P2pNcclConnector.

Test Result

Mooncake Connector

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  30.93     
Total input tokens:                      23260     
Total generated tokens:                  21350     
Request throughput (req/s):              3.23      
Output token throughput (tok/s):         690.19    
Total Token throughput (tok/s):          1442.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          2692.51   
Median TTFT (ms):                        2662.22   
P99 TTFT (ms):                           4799.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.64     
Median TPOT (ms):                        37.76     
P99 TPOT (ms):                           42.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.26     
Median ITL (ms):                         37.34     
P99 ITL (ms):                            43.69     
==================================================

P2pNcclConnector

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  31.97     
Total input tokens:                      23260     
Total generated tokens:                  21881     
Request throughput (req/s):              3.13      
Output token throughput (tok/s):         684.44    
Total Token throughput (tok/s):          1412.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          2934.43   
Median TTFT (ms):                        3148.52   
P99 TTFT (ms):                           5053.23   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.22     
Median TPOT (ms):                        38.10     
P99 TPOT (ms):                           43.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.47     
Median ITL (ms):                         37.59     
P99 ITL (ms):                            46.06     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-12T04:50:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces the Mooncake Transfer Engine as a new kv_connector. The implementation provides a solid framework for KV cache transfer using Mooncake. My review focuses on improving synchronization, resource management, and configurability in the new MooncakeConnectorWorker.

I've identified a few high-severity issues:

An inefficient busy-wait loop for synchronization that should be replaced with a condition variable.
Inefficient ZMQ context management that can lead to performance degradation.
Hardcoded configuration values that limit the connector's flexibility.

gemini-code-assist · 2025-09-12T04:52:18Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

The initialize call for the TransferEngine uses hardcoded values "P2PHANDSHAKE" and "rdma". This reduces the flexibility of the connector and makes it difficult to adapt to different network configurations or Mooncake setups without modifying the code. These values should be made configurable, for instance, by reading them from vllm_config.kv_transfer_config.kv_connector_extra_config.

Suggested change

ret_value = self.engine.initialize(self.hostname, "P2PHANDSHAKE", "rdma", "")

extra_config = self.vllm_config.kv_transfer_config.kv_connector_extra_config

mode = extra_config.get("mooncake_init_mode", "P2PHANDSHAKE")

protocol = extra_config.get("mooncake_protocol", "rdma")

device = extra_config.get("mooncake_device", "")

ret_value = self.engine.initialize(self.hostname, mode, protocol, device)

gemini-code-assist · 2025-09-12T04:52:18Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

The while loop with time.sleep(0.1) is a form of busy-waiting for start_load_kv() to provide the local_block_ids. This is inefficient, can introduce unnecessary latency, and is not guaranteed to be correct under all timing conditions. A more robust and efficient approach is to use a threading.Condition for synchronization.

This would involve:

Modifying SendReqMeta to use a threading.Condition.

In send_kv_to_decode, waiting on the condition variable instead of sleeping.

In start_load_kv, notifying waiters on the condition variable after populating the data.

gemini-code-assist · 2025-09-12T04:52:18Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

The zmq_ctx context manager creates a new zmq.Context on every call. This is inefficient, as ZMQ contexts are designed to be created once per process and shared among all sockets in that process. In receive_kv, a new thread is spawned for each request, and each thread calls zmq_ctx, leading to context creation and destruction for every KV transfer. This can be a significant performance bottleneck.

A better approach is to:

Create a single zmq.Context in MooncakeConnectorWorker.__init__.

Use this shared context to create sockets directly in _mooncake_handshake_listener and receive_kv (e.g., with self.zmq_context.socket(zmq.REQ) as sock:).

Terminate the context in a shutdown method for the worker.

The zmq_ctx function can then be removed.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

NickLucche

Hey @dtcccc , thanks a lot for contributing to vLLM!
Mooncake engine was actually something I already had high on the todo list, happy to see it here.
My plan though was to use it as a Nixl backend plugin https://github.com/ai-dynamo/nixl/blob/main/src/plugins/mooncake/README.md so that we wouldn't have to re-implement a separate connector logic for it.

Have you already considered this option?

dtcccc · 2025-09-12T08:21:33Z

Hey @dtcccc , thanks a lot for contributing to vLLM! Mooncake engine was actually something I already had high on the todo list, happy to see it here. My plan though was to use it as a Nixl backend plugin https://github.com/ai-dynamo/nixl/blob/main/src/plugins/mooncake/README.md so that we wouldn't have to re-implement a separate connector logic for it.

Have you already considered this option?

Thanks for your reply @NickLucche

I am a beginner and not very familiar with nixl. However, nixl is quite hard for me to use. I prefer simple deployment, so I use "pip install" for installation. The new version (>0.3.0) gives me a UCX ERROR when I start it, while the old version throws a RuntimeError: BAD_STATUS at the decode node during execution. I am also actively looking for a solution, but this topic is unrelated to the current PR.

On the other hand, I want to try directly using the mooncake transfer engine to improve performance. I have studied a lot from the NixlConnector code and referenced it, and I'm exploring ways to achieve better performance by reducing one layer of encapsulation and eliminating one handshake (P directly send kvcaches to D).

I agree that this PR introduces some redundancy in the existing code. From a maintainability perspective, would it be possible to abstract common parts at the vllm code level and merge them, while still retaining the capability to directly integrate with mooncake?

xiaguan · 2025-09-15T02:38:50Z

Hi, would you mind submitting this PR to the Mooncake repo first? We could test it via an external connector for now.

dtcccc · 2025-09-16T06:58:02Z

Hi, would you mind submitting this PR to the Mooncake repo first? We could test it via an external connector for now.

Sure. Which branch and path should I submit? Could you provide a guideline?

xiaguan · 2025-09-16T07:52:53Z

Copy the Python file to https://github.com/kvcache-ai/Mooncake/tree/main/mooncake-wheel/mooncake, submit it to main, and add a basic README showing how to use it.

mergify · 2025-09-17T04:10:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dtcccc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dtcccc requested a review from NickLucche as a code owner September 12, 2025 04:50

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

dtcccc force-pushed the dtcccc/mooncake_connector branch from 72775a2 to 8af784a Compare September 12, 2025 06:37

[P/D] Introduce Mooncake Transfer Engine as kv_connector

9dcc73d

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

dtcccc force-pushed the dtcccc/mooncake_connector branch from 8af784a to 9dcc73d Compare September 12, 2025 07:00

NickLucche requested changes Sep 12, 2025

View reviewed changes

mergify bot added needs-rebase kv-connector labels Sep 17, 2025

dtcccc mentioned this pull request Sep 19, 2025

[Misc] feat: Support external kv_connector for vllm v1 kvcache-ai/Mooncake#865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718

[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718

dtcccc commented Sep 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Uh oh!

gemini-code-assist bot Sep 12, 2025

Uh oh!

gemini-code-assist bot Sep 12, 2025

Uh oh!

NickLucche left a comment

Uh oh!

dtcccc commented Sep 12, 2025 •

edited

Loading

Uh oh!

xiaguan commented Sep 15, 2025

Uh oh!

dtcccc commented Sep 16, 2025

Uh oh!

xiaguan commented Sep 16, 2025

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        ret_value = self.engine.initialize(self.hostname, "P2PHANDSHAKE", "rdma", "")
+        extra_config = self.vllm_config.kv_transfer_config.kv_connector_extra_config
+        mode = extra_config.get("mooncake_init_mode", "P2PHANDSHAKE")
+        protocol = extra_config.get("mooncake_protocol", "rdma")
+        device = extra_config.get("mooncake_device", "")
+        ret_value = self.engine.initialize(self.hostname, mode, protocol, device)

Uh oh!

[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718

Are you sure you want to change the base?

[P/D] Introduce Mooncake Transfer Engine as kv_connector #24718

Conversation

dtcccc commented Sep 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

dtcccc commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaguan commented Sep 15, 2025

Uh oh!

dtcccc commented Sep 16, 2025

Uh oh!

xiaguan commented Sep 16, 2025

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dtcccc commented Sep 12, 2025 •

edited by github-actions bot

Loading

dtcccc commented Sep 12, 2025 •

edited

Loading