Description
Motivation.
This RFC tracks both DP and EP together, since these is a strong intersection between these features.
A key motivation for DP attention is reduced memory footprint in MLA models like DeepSeek V2, V3, and R1. If Tensor Parallelism is used in an MLA model, we duplicate the KV cache across GPUs, wasting memory. This reduces our batch size for throughput and high-QPS serving use cases, which kills our performance. To solve this, we need to parallelize attention across GPUs differently. In this RFC we are talking about request-level parallelism for this.
EP is Expert Parallelism for MoEs, where experts are distributed across EP ranks, and tokens are dispatched to the GPUs holding the experts that the token is routed to. This is in contrast to Tensor Parallel, which shards each expert. The main advantage is lower data movement, especially for large numbers of GPUs, where each token only needs to be sent to a subset of EP ranks. This also can be helpful so that the underlying grouped GEMM kernels operate on tensors with a better aspect ratio.
Proposed Change.
Roadmap.
- Initial support for DP+EP
- Single-node offline DP [core] set up data parallel communication #13591
- Initial (slow) DP Attention + EP MoE support [V1] EP/TP MoE + DP Attention #13931
- Multi-node offline DP multi-node offline DP+EP example #15484
- DP AsyncLLM Server
- Single node server support for DP [V1] AsyncLLM data parallel #13923
- Multinode server [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue #15906, [V1] DP scale-out (2/N): Decouple engine process management and comms #15977
- External load balancing for integration with external routers [DP] Support external DP Load Balancer mode #19790
- Sparse All-to-All ops
- Support CUDA graphs in MoE layers for DP+EP [Core] Enable CUDA graphs for DP + All2All kernels #18724
- Integrate PPLX kernels (Integrate PPLX-kernels #16039)
- Integrate DeepEP+DeepGEMM kernels [Kernel] DeepEP dispatch-combine kernel integration #18434
- Resolve open issues for DeepGEMM+DeepEP prefill case
- Further optimize DeepEP+DeepGEMM low latency (decode) case (partially done in [EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case #19885)
- Further optimizations
- Dual Batch Overlap
- EPLB - [Feature] Expert Parallelism Load Balancer (EPLB) #18343
- Add DeepEP+PPLX+NIXL dependencies to Docker image [Feature]: Add EP/DP/PD deps in docker image #19653
- Blocker: Prefill/Decode disaggregation support
- CPU connector [WIP] [Core][P/D] CPU connector for PD disagg #18332
- Resolve open issues with GPU NIXL connector with fragmented KV caches
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.