Skip to content

Comments

[PG] Implement efficient P2P proxy for low-latency send/recv communication#1533

Open
yuechen-sys wants to merge 6 commits intokvcache-ai:mainfrom
yuechen-sys:pr/p2p-improve
Open

[PG] Implement efficient P2P proxy for low-latency send/recv communication#1533
yuechen-sys wants to merge 6 commits intokvcache-ai:mainfrom
yuechen-sys:pr/p2p-improve

Conversation

@yuechen-sys
Copy link
Collaborator

@yuechen-sys yuechen-sys commented Feb 11, 2026

Description

This PR introduces a series of improvements to send/recv communication in the Mooncake backend. These changes address the performance issues reported in Issue #1421.

This PR also solves some remaining issues in previous work and achieve better performance.

Changes

P2P Proxy

In most xCCL libraries (e.g., NCCL, UCCL, RCCL), send/recv serve as the fundamental building blocks for higher-level collective operations such as all_reduce, all_gather, etc. For example, the ring-based all_reduce algorithm is composed of multiple stages of send/recv. Therefore, P2P communication should be designed as an independent module that other collective primitives can reliably build upon.

To support concurrent bidirectional traffic efficiently, we split the proxy into two dedicated components:

P2P Proxy Architecture

We introduce new abstractions for proxy

  • OpContext, TransferTask and PeerLane

    • SendOpContext represents a single logical send operation.
    • Each SendOpContext is decomposed into multiple per-chunk SendTransferTasks.
    • The receive side follows the same abstraction:
      • RecvOpContext
      • RecvTransferTask
    • SendPeerLane and RecvPeerLane are responsible for managing resources in a single peer direction.

We further decouple the control plane and data plane:

  • CPU (Control Plane)

    • Prepares and consumes ring-buffer slots
    • Submits data copy tasks
    • Issues transport operations (e.g., RDMA, NVLink)
  • GPU SM (Data Plane)

    • Executes the actual data movement

By separating control signaling from data movement, this design minimizes synchronization overhead and allows CPU-side scheduling to overlap with GPU-side execution, improving overall throughput and concurrency.

Ring Buffer

Note that data consistency for send/recv should be managed at the user level rather than enforced by the backend implementation. The backend guarantees strict ordering of operations but assumes that the peer issues matching calls with the same message size.

To improve large tensor transfers, we introduce a ring-buffer–based P2P design. This enables efficient chunked writes while eliminating explicit TCP-based ack synchronization. Instead of relying on an explicit acknowledgment flag, we use implicit head/tail coordination between the sender and receiver.

A ring buffer is well-suited for our scenario, as it naturally supports pipelined chunked transfers with minimal synchronization overhead. The ring buffer operates under the following conditions:

  • The buffer is empty when head == tail (the receiver waits for incoming data)
  • The buffer is full when (head + 1) % capacity == tail (the sender waits for available slots)
  • When head != tail, the receiver consumes available data

Future Plans

TENT

Current Intra-node P2P performance is bounded by RDMA-only Mooncake TE and static topology. Switching to Mooncake TENT is expected to close the performance gap between Mooncake PG and NCCL.

GPU SM channels

GPU SM have a huge advantages in data path compared with CPU-based proxy.

  • Multi-channel SM: multi-threaded GPU to handle send ops concurrently.
  • Low-latency Synchronization: GPU directly update remote head/tail to avoid CPU-GPU latency.

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Send/receive performance is measured using a thorough in-house benchmark similar to nccl-test. The performance of small message is imprecise, since the lack of barrier in mooncake backend.

The performance comparison is slightly different compared with previous one, as new design reduce latency of small message size.

Note: performance is messured under the same buffer size (16MB per-peer) with NCCL

bandwidth latency

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

  - Add `P2PProxy` to own P2P workers, queues, and P2P resource lifecycle
  - Move backend P2P send/recv processing into proxy and delegate from `MooncakeBackend`
  - Introduce per-peer `PeerLane` to handle inter-peer communication
  - Implement Lamport's circular buffer algorithm for RDMA
  - Support CUDA async copy
- Batching head/tail updates to avoid frequent update overhead
- Remove seq check
- Code format
- Add _mm_pause()
- Implement cuda event pool for faster data copy submit
- Set per-proxy cuda device index
- Improve code readability for send/recv worker
- Remove condition variable wait
- Add documentation for control slots
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yuechen-sys, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request overhauls the point-to-point communication mechanism within the Mooncake backend. By introducing a sophisticated P2P proxy, it fundamentally redesigns how data is sent and received between ranks. The changes focus on enhancing efficiency, concurrency, and scalability through a clear separation of concerns and optimized data structures, leading to a more performant and architecturally sound communication layer.

Highlights

  • New P2P Proxy Architecture: Introduced a robust P2P proxy with dedicated send and receive components to efficiently handle concurrent bidirectional traffic, addressing performance and architectural issues.
  • Decoupled Control and Data Planes: Separated CPU-based control signaling (preparing slots, submitting tasks) from GPU SM-based data movement, minimizing synchronization overhead and improving throughput.
  • Ring-Buffer Based Communication: Implemented a ring-buffer design for P2P transfers, enabling efficient chunked data movement and implicit head/tail coordination, eliminating the need for explicit TCP-based acknowledgments.
  • New Abstractions for P2P Operations: Defined OpContext, TransferTask, and PeerLane to manage P2P operations, breaking down logical send/recv into granular, chunk-based tasks.
  • Performance Improvements: Achieved significant performance gains in send/recv communication, particularly for small message sizes, as demonstrated by updated benchmarks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-pg/include/mooncake_backend.h
    • Removed legacy P2P operation structures and worker thread declarations.
    • Included the new p2p_proxy.hh header.
    • Replaced the old P2P queue and thread infrastructure with a std::unique_ptr<P2PProxy>.
  • mooncake-pg/include/mooncake_worker.cuh
    • Removed obsolete P2P slot constants.
    • Added new P2P buffer and control region addresses to SegmentInfo.
    • Removed old P2P sequence tracking variables from TransferGroupMeta.
  • mooncake-pg/include/p2p_proxy.hh
    • Added a new header file defining the P2PProxy class and its supporting data structures.
    • Introduced constants for P2P buffer sizes and slot counts.
    • Defined AtomicHeadTail and P2PControlSlot for ring buffer management.
    • Declared core P2P operation contexts, transfer tasks, and peer lane management structures.
  • mooncake-pg/setup.py
    • Added src/p2p_proxy.cc to the list of compiled source files.
    • Adjusted compilation arguments for improved readability.
  • mooncake-pg/src/mooncake_backend.cpp
    • Included the new p2p_proxy.hh header.
    • Simplified MooncakeP2PWork by removing error message handling.
    • Initialized and managed the P2PProxy instance within the MooncakeBackend constructor and destructor.
    • Updated SegmentInfo with P2P proxy buffer and control region pointers.
    • Refactored send and recv methods to utilize the new P2PProxy for enqueueing operations.
    • Modified shutdown() to properly stop and release P2PProxy resources.
    • Updated connectionPoller to free batch IDs.
    • Delegated startP2PWorker and stopP2PWorker calls to the P2PProxy instance.
    • Removed all deprecated P2P worker thread functions and related processing logic.
  • mooncake-pg/src/p2p_proxy.cc
    • Added a new source file implementing the P2PProxy class functionality.
    • Implemented resource allocation and deallocation for P2P buffers and control regions.
    • Provided methods to start and stop dedicated send and receive worker threads.
    • Implemented logic for enqueuing send and receive operations.
    • Defined helper functions for calculating local and remote P2P slot addresses.
    • Implemented the state machine for processing send and receive transfer tasks, including data copy, transfer, and head/tail commit steps.
    • Implemented the SendWorkerThread and RecvWorkerThread to manage the lifecycle of P2P operations.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring of the P2P communication logic by encapsulating it within a new P2PProxy class. This greatly improves the code structure, separating concerns and making the MooncakeBackend implementation much cleaner and easier to follow. The new design, based on a ring buffer with implicit head/tail coordination, is a solid approach for achieving low-latency communication. I've also noticed and appreciate the fix for a potential resource leak in connectionPoller.

I have a few suggestions to further improve performance and maintainability, mainly concerning a performance-critical synchronization point in the receive path and some code clarity improvements. Overall, this is an excellent contribution.

Comment on lines +858 to +865
op_ctx.original_tensor_.copy_(op_ctx.tensor_);
if (!is_cpu_) {
const cudaError_t sync_error = cudaDeviceSynchronize();
TORCH_CHECK(sync_error == cudaSuccess,
"P2P recv final copy cudaDeviceSynchronize "
"failed: ",
cudaGetErrorString(sync_error));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using cudaDeviceSynchronize() is a major performance concern as it stalls the entire GPU, affecting all running computations. For synchronizing the final copy to a non-contiguous tensor, it's much more efficient to use cudaStreamSynchronize() on the specific stream of the operation.

Additionally, at::Tensor::copy_ is enqueued on the current CUDA stream. To ensure it uses the correct stream associated with the operation (op_ctx.cuda_stream_), you should use an at::cuda::CUDAStreamGuard.

                    if (is_cpu_) {
                        op_ctx.original_tensor_.copy_(op_ctx.tensor_);
                    } else {
                        at::cuda::CUDAStreamGuard guard(op_ctx.cuda_stream_);
                        op_ctx.original_tensor_.copy_(op_ctx.tensor_);
                        const cudaError_t sync_error = cudaStreamSynchronize(op_ctx.cuda_stream_);
                        TORCH_CHECK(sync_error == cudaSuccess,
                                    "P2P recv final copy cudaStreamSynchronize failed: ",
                                    cudaGetErrorString(sync_error));
                    }

#include <cstddef>
#include <cstdlib>
#include <cstring>
#include <thread>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PAUSE() macro is used in the worker threads for busy-waiting, which is good for low-latency spin loops. However, its definition is not present in this file, which can lead to compilation issues. It's good practice to include the necessary header (e.g., <immintrin.h> for _mm_pause() on x86) and provide a definition.

#include <thread>
#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h> // For _mm_pause()
#define PAUSE() _mm_pause()
#else
#define PAUSE() do {} while (false)
#endif

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants