[PG] Implement efficient P2P proxy for low-latency send/recv communication#1533
[PG] Implement efficient P2P proxy for low-latency send/recv communication#1533yuechen-sys wants to merge 6 commits intokvcache-ai:mainfrom
Conversation
- Add `P2PProxy` to own P2P workers, queues, and P2P resource lifecycle - Move backend P2P send/recv processing into proxy and delegate from `MooncakeBackend` - Introduce per-peer `PeerLane` to handle inter-peer communication - Implement Lamport's circular buffer algorithm for RDMA - Support CUDA async copy
- Batching head/tail updates to avoid frequent update overhead - Remove seq check - Code format
- Add _mm_pause() - Implement cuda event pool for faster data copy submit - Set per-proxy cuda device index - Improve code readability for send/recv worker
- Remove condition variable wait - Add documentation for control slots
Summary of ChangesHello @yuechen-sys, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request overhauls the point-to-point communication mechanism within the Mooncake backend. By introducing a sophisticated P2P proxy, it fundamentally redesigns how data is sent and received between ranks. The changes focus on enhancing efficiency, concurrency, and scalability through a clear separation of concerns and optimized data structures, leading to a more performant and architecturally sound communication layer. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the P2P communication logic by encapsulating it within a new P2PProxy class. This greatly improves the code structure, separating concerns and making the MooncakeBackend implementation much cleaner and easier to follow. The new design, based on a ring buffer with implicit head/tail coordination, is a solid approach for achieving low-latency communication. I've also noticed and appreciate the fix for a potential resource leak in connectionPoller.
I have a few suggestions to further improve performance and maintainability, mainly concerning a performance-critical synchronization point in the receive path and some code clarity improvements. Overall, this is an excellent contribution.
| op_ctx.original_tensor_.copy_(op_ctx.tensor_); | ||
| if (!is_cpu_) { | ||
| const cudaError_t sync_error = cudaDeviceSynchronize(); | ||
| TORCH_CHECK(sync_error == cudaSuccess, | ||
| "P2P recv final copy cudaDeviceSynchronize " | ||
| "failed: ", | ||
| cudaGetErrorString(sync_error)); | ||
| } |
There was a problem hiding this comment.
Using cudaDeviceSynchronize() is a major performance concern as it stalls the entire GPU, affecting all running computations. For synchronizing the final copy to a non-contiguous tensor, it's much more efficient to use cudaStreamSynchronize() on the specific stream of the operation.
Additionally, at::Tensor::copy_ is enqueued on the current CUDA stream. To ensure it uses the correct stream associated with the operation (op_ctx.cuda_stream_), you should use an at::cuda::CUDAStreamGuard.
if (is_cpu_) {
op_ctx.original_tensor_.copy_(op_ctx.tensor_);
} else {
at::cuda::CUDAStreamGuard guard(op_ctx.cuda_stream_);
op_ctx.original_tensor_.copy_(op_ctx.tensor_);
const cudaError_t sync_error = cudaStreamSynchronize(op_ctx.cuda_stream_);
TORCH_CHECK(sync_error == cudaSuccess,
"P2P recv final copy cudaStreamSynchronize failed: ",
cudaGetErrorString(sync_error));
}| #include <cstddef> | ||
| #include <cstdlib> | ||
| #include <cstring> | ||
| #include <thread> |
There was a problem hiding this comment.
The PAUSE() macro is used in the worker threads for busy-waiting, which is good for low-latency spin loops. However, its definition is not present in this file, which can lead to compilation issues. It's good practice to include the necessary header (e.g., <immintrin.h> for _mm_pause() on x86) and provide a definition.
#include <thread>
#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h> // For _mm_pause()
#define PAUSE() _mm_pause()
#else
#define PAUSE() do {} while (false)
#endifCo-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
This PR introduces a series of improvements to send/recv communication in the Mooncake backend. These changes address the performance issues reported in Issue #1421.
This PR also solves some remaining issues in previous work and achieve better performance.
Changes
P2P Proxy
In most xCCL libraries (e.g., NCCL, UCCL, RCCL),
send/recvserve as the fundamental building blocks for higher-level collective operations such asall_reduce,all_gather, etc. For example, the ring-basedall_reducealgorithm is composed of multiple stages ofsend/recv. Therefore, P2P communication should be designed as an independent module that other collective primitives can reliably build upon.To support concurrent bidirectional traffic efficiently, we split the proxy into two dedicated components:
We introduce new abstractions for proxy
OpContext,TransferTaskandPeerLaneSendOpContextrepresents a single logicalsendoperation.SendOpContextis decomposed into multiple per-chunkSendTransferTasks.RecvOpContextRecvTransferTaskSendPeerLaneandRecvPeerLaneare responsible for managing resources in a single peer direction.We further decouple the control plane and data plane:
CPU (Control Plane)
GPU SM (Data Plane)
By separating control signaling from data movement, this design minimizes synchronization overhead and allows CPU-side scheduling to overlap with GPU-side execution, improving overall throughput and concurrency.
Ring Buffer
Note that data consistency for
send/recvshould be managed at the user level rather than enforced by the backend implementation. The backend guarantees strict ordering of operations but assumes that the peer issues matching calls with the same message size.To improve large tensor transfers, we introduce a ring-buffer–based P2P design. This enables efficient chunked writes while eliminating explicit TCP-based
acksynchronization. Instead of relying on an explicit acknowledgment flag, we use implicithead/tailcoordination between the sender and receiver.A ring buffer is well-suited for our scenario, as it naturally supports pipelined chunked transfers with minimal synchronization overhead. The ring buffer operates under the following conditions:
head == tail(the receiver waits for incoming data)(head + 1) % capacity == tail(the sender waits for available slots)head != tail, the receiver consumes available dataFuture Plans
TENT
Current Intra-node P2P performance is bounded by RDMA-only Mooncake TE and static topology. Switching to Mooncake TENT is expected to close the performance gap between Mooncake PG and NCCL.
GPU SM channels
GPU SM have a huge advantages in data path compared with CPU-based proxy.
Type of Change
How Has This Been Tested?
Send/receive performance is measured using a thorough in-house benchmark similar to nccl-test. The performance of small message is imprecise, since the lack of barrier in mooncake backend.
The performance comparison is slightly different compared with previous one, as new design reduce latency of small message size.
Note: performance is messured under the same buffer size (16MB per-peer) with NCCL
Checklist
./scripts/code_format.shbefore submitting.