Integrate deep-ep nccl backend by irexyc · Pull Request #4477 · InternLM/lmdeploy

irexyc · 2026-03-27T12:51:15Z

Pull request overview

This PR integrates DeepEP-based Expert Parallelism (EP) over the NCCL backend into TurboMind, wiring EP initialization into runtime context creation and extending LLaMA MoE execution to support EP token routing/dispatch/combine.

Changes:

Add DeepEP/NCCL EP backend (NcclCommImpl::InitializeEp/Dispatch/Combine) and build it as a new deepep static library.
Extend TurboMind engine/model parameters for EP (ep_size, ep_rank, ll_max_tokens_per_rank) and initialize EP in TurboMind::Impl::CreateContext.
Update LLaMA unified decoder + MoE FFN to support EP routing and add a fused RMSNorm path that supports EP token partitioning (ReduceScatterV/AllGatherV).

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/turbomind/turbomind.cc	Parse EP/LL params and initialize EP in device communicator during context setup
src/turbomind/models/llama/unified_decoder.{h,cc}	Add EP-aware hidden-state layout + fused RMSNorm integration + partial-token FFN execution
src/turbomind/models/llama/moe_ffn_layer.{h,cc}	Add EP routing/dispatch/combine implementation and EP-mode state
src/turbomind/models/llama/llama_params.h	Add EP + LL threshold parameters to engine/moe config
src/turbomind/models/llama/LlamaDenseWeight.{h,cc}	Shard MoE expert weights by `ep_size/ep_rank`
src/turbomind/models/llama/LlamaDecoderLayerWeight.{h,cc}	Thread EP params into MoE weight construction; adjust MLP TP handling for EP
src/turbomind/models/llama/FusedRMSNormLayer.h	New TP/EP fused RMSNorm abstraction with EP ReduceScatterV/AllGatherV
src/turbomind/kernels/gemm/moe_ep_utils.{h,cu}	New kernels/utilities for EP gating and (LL/HT) combine helpers
src/turbomind/comm/device_comm.h	Extend device-comm interface with ReduceScatterV/AllGatherV and EP APIs
src/turbomind/comm/nccl/{nccl_comm.h,nccl.cu,nccl_ep.cu}	Refactor NCCL comm impl into header + add DeepEP EP ops
src/turbomind/comm/nccl/deep_ep/*	Vendored DeepEP implementation and kernels
src/turbomind/comm/nccl/CMakeLists.txt	Build/link `deepep` and include EP source in `nccl_comm`
lmdeploy/turbomind/turbomind.py	Add EP parallel-config derivation in Python front-end
lmdeploy/turbomind/deploy/{config.py,converter.py,module.py}	Plumb `ep_size` into deploy config and TP sizing for EP
lmdeploy/messages.py	Add `ep` to `TurbomindEngineConfig`
lmdeploy/cli/serve.py	Add CLI wiring to pass `--ep` into engine config
src/turbomind/models/llama/llama_utils.cu	Add `Compare<int64_t>` instantiation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T12:58:50Z

+    if not complete_parallel_config(cfg) and cfg.ep > 1:
+        if cfg.communicator in ['cuda-ipc', 'native']:
+            assert cfg.nnodes == 1, 'TurboMind does not support multi-node with ep > 1'
+        total = cfg.dp * cfg.ep
+        if not cfg.device_num:
+            count = torch.cuda.device_count() * cfg.nnodes
+            if total < count:
+                count = total
+            cfg.device_num = count
+        assert total % cfg.device_num == 0
+        overlap = total // cfg.device_num
+        attn_dp_size = overlap
+        inner_tp_size = cfg.ep // overlap
+        cfg.outer_dp_size = cfg.dp // overlap
+        cfg.attn_dp_size = overlap // cfg.nnodes
+        cfg.attn_tp_size = inner_tp_size // cfg.cp
+        cfg.attn_cp_size = cfg.cp
+        cfg.mlp_dp_size = 1
+        cfg.mlp_tp_size = cfg.attn_dp_size * cfg.attn_tp_size * cfg.attn_cp_size


EP path can compute attn_dp_size = overlap // cfg.nnodes, which becomes 0 for common multi-node cases (e.g., overlap==1 and nnodes>1), violating later invariants and producing invalid parallel config. Since device_num already accounts for nnodes, avoid dividing overlap by nnodes here (or otherwise ensure attn_dp_size>=1 with a correct derivation).

Copilot · 2026-03-27T12:58:50Z


+    void SetWarpup(ForwardParam& p);
+
+    void ForwardNative(ForwardParam& p);
+
+    void ForwardFused(ForwardParam& p);
+
+    void RouteTP(ForwardParam& p, Tensor_<float>& logits);
+
+    void RouteEP(ForwardParam& p, Tensor_<float>& logits);
+
+    void CombineTP(ForwardParam& p);
+
+    void CombineEP(ForwardParam& p);


Typo in new private helper name SetWarpup (should be SetWarmup). Keeping the misspelling makes call sites harder to discover/search and looks inconsistent with the existing is_warm_up_ naming.

Copilot · 2026-03-27T12:58:50Z

+struct EpCombineInput {
+    EpMode&                     mode;
+    core::Tensor&               x;
+    std::vector<core::Tensor>&  handle;
+    std::optional<core::Tensor> topk_weights;
+    std::optional<core::Tensor> topk_idx;
+};


EpCombineInput uses std::optional, but this header doesn't include <optional>, which will cause compilation errors depending on include order. Add #include <optional> (and keep headers self-contained).

Copilot · 2026-03-27T12:58:51Z

+    int  comm_nranks_  = -1;     // Number of ranks in NCCL communicator
+
+    ncclComm_t nccl_comm_;
+
+    ncclDevComm_t dev_ht_comm_{};
+    ncclDevComm_t dev_ll_comm_{};
+
+    std::unordered_map<void*, ncclWindow_t> wins_;
+    std::unordered_map<void*, size_t>       buffers_;
+
+    // GIN signal management


This header declares std::unordered_map members (wins_, buffers_) but doesn't include <unordered_map>, which will fail to compile in translation units that include this header first. Add the missing include (and any other required STL headers) to keep the header self-contained.

Copilot · 2026-03-27T12:58:51Z

@@ -596,7 +600,7 @@ MoeFfnWeight::MoeFfnWeight(int             layer_id,
                                                group_size,
                                                act_type,
                                                fuse_silu_act});
-        register_module("experts", *experts.back(), i);
+        register_module("experts", *experts.back(), i + expert_offset);
    }


local_expert_num is computed via integer division (expert_num / ep_size) without validating divisibility. If expert_num isn't a multiple of ep_size, this will silently drop experts and mis-register / mis-load weights. Add a TM_CHECK_EQ(expert_num % ep_size, 0) (and ideally validate ep_rank < ep_size) before computing local_expert_num/expert_offset.

- Resolve conflicts in nccl.cu / turbomind.cc / LlamaDecoderLayerWeight.cc / LlamaLinear.cu / CMakeLists.txt, adopting main's modern fmt-style logger while preserving moe-2's EP / DeepEP / ContextGuard additions. - Migrate remaining printf-style TM_LOG_* calls in moe-2 added files (deep_ep.cpp, gin_backend.cu, nccl_ep.cu) to fmt-style ({} placeholders), rename TM_LOG_WARNING to TM_LOG_WARN, and switch utils/logger.h includes to core/logger.h. Made-with: Cursor

Merge of origin/main (aed026f) into moe-2 (aa20784); merge-base e38927c. Two upstream refactors required re-applying moe-2's DeepEP / expert-parallel work onto a changed foundation rather than a textual merge: - 01ddf16 (turbomind modeling infra): deleted LlamaDenseWeight / LlamaDecoderLayerWeight / LlamaWeight and the Python turbomind/deploy/* conversion stack; replaced by model_weight / decoder_layer_weight / moe_weight / ffn_weight / linear_weight / ... and a new Python loader. MoeParam/ModelParam removed; geometry now flows via core::*Config X-macros. - a4025b9 (CUDA error handling): removed check_cuda_error / sync_check_cuda_error / FT_CHECK / CUDRVCHECK in favour of TM_CUDA_CHECK / TM_CUDRV_CHECK / TM_CHECK + manual scope tracing. Port summary: - EngineConfig + core::MoeConfig X-macros gain ep_size / ep_rank / ll_max_tokens_per_rank (auto-bound to Python). MoeWeight carries them and prepare() links only the local expert window (local_num_experts/local_expert_offset), exemplar = first local expert. - moe_ffn_layer / unified_decoder reconciled onto origin ctors + MoeWeight / DecoderLayerWeight accessors; RouteEP/CombineEP/SetWarmup and FusedRMSNormLayer/HiddenStateLayout layered on. Fused-only path (origin has no MoeParam::kNaive equivalent). - LlamaLinear: origin LinearWeight/out-param API + EP fp8-scales overload; dispatch driven by total mapping size to match merged moe_utils_v2 (num_expert_tokens) semantics. - nccl.cu: kept moe-2's out-of-line structure (needed by nccl_ep/nccl_comm.h), re-applied the a4025b9 macro/scope conversion. - New moe-2 files (moe_ep_utils, nccl_ep, FusedRMSNormLayer.h) converted to the new error macros. - turbomind.cc: dropped YAML parsing (EngineConfig-driven); EP InitializeEp relocated to ProcessWeights where ModelWeight geometry is known. - Python: ec.ep_size / ec.ll_max_tokens_per_rank plumbed; make_moe_config gained EP fields (defaults keep the TP path identical). Build verified: `ninja _turbomind` links the full extension cleanly. Follow-up (runtime EP enablement, owner-validated): per-GPU ep_rank ParallelGroup through the new model loader/builders and local-expert-range expert construction across the rewritten model specs. Defaults make ep_size→1 (safe TP fallback); non-EP paths are unaffected. Backup: branch backup/moe-2-premerge, tag premerge-moe-2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lvhan028 · 2026-05-19T03:10:37Z

+
+  target_link_libraries(nccl_comm PRIVATE deepep)
+else()
+  message(STATUS "Skip deepep build because NCCL ${NCCL_VERSION_STRING} < 2.29.7")


Should we raise FATAL error message?

lvhan028 · 2026-05-19T03:17:02Z

+    int num_nodes;
+    int num_experts;
+    int experts_per_token;
+    int hidden;


convertional "hidden_size" or "hidden_dim" is more appreciated.

irexyc added 2 commits March 27, 2026 11:48

integrate deep-ep nccl backend (intranode + low_latency kernels

d9817ba

internode normal kernels

2769bd0

Copilot AI review requested due to automatic review settings March 27, 2026 12:51

Copilot started reviewing on behalf of irexyc March 27, 2026 12:51 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

irexyc added 5 commits March 30, 2026 12:45

fix internode

64acac6

update build

c3bb4f3

update build

e83ab90

fix windows build

03f6f09

fix windows build

70fe0e0

lvhan028 added the enhancement New feature or request label Apr 2, 2026

irexyc and others added 14 commits April 7, 2026 14:08

move deepep to 3rdparty

acc13a9

fix fp8 model with bf16 dispatch

012cb0c

use fp8 dispatch for ht kernels

73ba5b8

update ll-combine-kernel to use dense input x

17dc755

remove the layout transformation in ll-dispatch-postprocess

50e46ed

support fp8-model-fp8-dispath for ll kernel

c9f4a1e

support bf16-model-fp8-dispath for ll kernel

4ce0941

fix lint

57a698a

fix NcclCommImpl::Broadcast

2ea24c0

zero-copy for ll kernel combine

4b33329

remove busy-wait for ll

db371dd

remove busy-wait for ht

38efa01

fix ht combine after removing busy-wait

fb0abad

lvhan028 mentioned this pull request Apr 29, 2026

[Feature] TurboMind后端支持视觉模型 #4562

Open

allocate buffer in advance

aa20784

lvhan028 mentioned this pull request May 8, 2026

feat: Turbomind linear gdn prefix caching #4465

Open

lvhan028 requested review from lvhan028 and lzhangzz May 12, 2026 09:35

irexyc and others added 4 commits May 18, 2026 04:55

fix lint

ab7b13e

fix convert

e4e3374

update

257d934

lvhan028 reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate deep-ep nccl backend#4477

Integrate deep-ep nccl backend#4477
irexyc wants to merge 26 commits into
InternLM:mainfrom
irexyc:moe-2

irexyc commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

lvhan028 May 19, 2026

Uh oh!

lvhan028 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

irexyc commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants