Skip to content

Conversation

dtcccc
Copy link

@dtcccc dtcccc commented Sep 12, 2025

Purpose

Introduce Mooncake Transfer Engine named mooncake_connector for vllm v1.
This is a basic framework, and will support more features in future.

Test Plan

Running Qwen/Qwen2.5-7B-Instruct on 2 A10 servers with 1P1D. (Using nixl_integration/toy_proxy_server.py as lb)
Running benchmark of sharegpt dataset with 100 num-prompts and default params, and comparing with P2pNcclConnector with PUT_ASYNC mode.
Note: I met some error when running NixlConnector, so only compared with P2pNcclConnector.

Test Result

Mooncake Connector

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  30.93     
Total input tokens:                      23260     
Total generated tokens:                  21350     
Request throughput (req/s):              3.23      
Output token throughput (tok/s):         690.19    
Total Token throughput (tok/s):          1442.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          2692.51   
Median TTFT (ms):                        2662.22   
P99 TTFT (ms):                           4799.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.64     
Median TPOT (ms):                        37.76     
P99 TPOT (ms):                           42.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.26     
Median ITL (ms):                         37.34     
P99 ITL (ms):                            43.69     
==================================================

P2pNcclConnector

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  31.97     
Total input tokens:                      23260     
Total generated tokens:                  21881     
Request throughput (req/s):              3.13      
Output token throughput (tok/s):         684.44    
Total Token throughput (tok/s):          1412.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          2934.43   
Median TTFT (ms):                        3148.52   
P99 TTFT (ms):                           5053.23   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.22     
Median TPOT (ms):                        38.10     
P99 TPOT (ms):                           43.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.47     
Median ITL (ms):                         37.59     
P99 ITL (ms):                            46.06     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@dtcccc dtcccc requested a review from NickLucche as a code owner September 12, 2025 04:50
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Mooncake Transfer Engine as a new kv_connector. The implementation provides a solid framework for KV cache transfer using Mooncake. My review focuses on improving synchronization, resource management, and configurability in the new MooncakeConnectorWorker.

I've identified a few high-severity issues:

  • An inefficient busy-wait loop for synchronization that should be replaced with a condition variable.
  • Inefficient ZMQ context management that can lead to performance degradation.
  • Hardcoded configuration values that limit the connector's flexibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The initialize call for the TransferEngine uses hardcoded values "P2PHANDSHAKE" and "rdma". This reduces the flexibility of the connector and makes it difficult to adapt to different network configurations or Mooncake setups without modifying the code. These values should be made configurable, for instance, by reading them from vllm_config.kv_transfer_config.kv_connector_extra_config.

Suggested change
ret_value = self.engine.initialize(self.hostname, "P2PHANDSHAKE", "rdma", "")
extra_config = self.vllm_config.kv_transfer_config.kv_connector_extra_config
mode = extra_config.get("mooncake_init_mode", "P2PHANDSHAKE")
protocol = extra_config.get("mooncake_protocol", "rdma")
device = extra_config.get("mooncake_device", "")
ret_value = self.engine.initialize(self.hostname, mode, protocol, device)

Comment on lines 427 to 439
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The while loop with time.sleep(0.1) is a form of busy-waiting for start_load_kv() to provide the local_block_ids. This is inefficient, can introduce unnecessary latency, and is not guaranteed to be correct under all timing conditions. A more robust and efficient approach is to use a threading.Condition for synchronization.

This would involve:

  1. Modifying SendReqMeta to use a threading.Condition.
  2. In send_kv_to_decode, waiting on the condition variable instead of sleeping.
  3. In start_load_kv, notifying waiters on the condition variable after populating the data.

Comment on lines +614 to +643
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The zmq_ctx context manager creates a new zmq.Context on every call. This is inefficient, as ZMQ contexts are designed to be created once per process and shared among all sockets in that process. In receive_kv, a new thread is spawned for each request, and each thread calls zmq_ctx, leading to context creation and destruction for every KV transfer. This can be a significant performance bottleneck.

A better approach is to:

  1. Create a single zmq.Context in MooncakeConnectorWorker.__init__.
  2. Use this shared context to create sockets directly in _mooncake_handshake_listener and receive_kv (e.g., with self.zmq_context.socket(zmq.REQ) as sock:).
  3. Terminate the context in a shutdown method for the worker.
  4. The zmq_ctx function can then be removed.

@dtcccc dtcccc force-pushed the dtcccc/mooncake_connector branch from 72775a2 to 8af784a Compare September 12, 2025 06:37
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
@dtcccc dtcccc force-pushed the dtcccc/mooncake_connector branch from 8af784a to 9dcc73d Compare September 12, 2025 07:00
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dtcccc , thanks a lot for contributing to vLLM!
Mooncake engine was actually something I already had high on the todo list, happy to see it here.
My plan though was to use it as a Nixl backend plugin https://github.com/ai-dynamo/nixl/blob/main/src/plugins/mooncake/README.md so that we wouldn't have to re-implement a separate connector logic for it.

Have you already considered this option?

@dtcccc
Copy link
Author

dtcccc commented Sep 12, 2025

Hey @dtcccc , thanks a lot for contributing to vLLM! Mooncake engine was actually something I already had high on the todo list, happy to see it here. My plan though was to use it as a Nixl backend plugin https://github.com/ai-dynamo/nixl/blob/main/src/plugins/mooncake/README.md so that we wouldn't have to re-implement a separate connector logic for it.

Have you already considered this option?

Thanks for your reply @NickLucche

I am a beginner and not very familiar with nixl. However, nixl is quite hard for me to use. I prefer simple deployment, so I use "pip install" for installation. The new version (>0.3.0) gives me a UCX ERROR when I start it, while the old version throws a RuntimeError: BAD_STATUS at the decode node during execution. I am also actively looking for a solution, but this topic is unrelated to the current PR.

On the other hand, I want to try directly using the mooncake transfer engine to improve performance. I have studied a lot from the NixlConnector code and referenced it, and I'm exploring ways to achieve better performance by reducing one layer of encapsulation and eliminating one handshake (P directly send kvcaches to D).

I agree that this PR introduces some redundancy in the existing code. From a maintainability perspective, would it be possible to abstract common parts at the vllm code level and merge them, while still retaining the capability to directly integrate with mooncake?

@xiaguan
Copy link

xiaguan commented Sep 15, 2025

Hi, would you mind submitting this PR to the Mooncake repo first? We could test it via an external connector for now.

@dtcccc
Copy link
Author

dtcccc commented Sep 16, 2025

Hi, would you mind submitting this PR to the Mooncake repo first? We could test it via an external connector for now.

Sure. Which branch and path should I submit? Could you provide a guideline?

@xiaguan
Copy link

xiaguan commented Sep 16, 2025

Copy the Python file to https://github.com/kvcache-ai/Mooncake/tree/main/mooncake-wheel/mooncake, submit it to main, and add a basic README showing how to use it.

Copy link

mergify bot commented Sep 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dtcccc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants