[Public release 26/04] Introducing EPv2: faster EP, and Engram/PP/CP supports#605
Conversation
build failed on cuda 12.8dependency: |
|
cuda/barrier contains device-only CUDA intrinsics (such as __cvta_generic_to_shared), which will cause compilation errors in host code. The previous issue was caused by the following include chain: python_api.cpp (host code) -> elastic/buffer.hpp -> layout.cuh -> ptx.cuh -> cuda/barrier. |
This appears to be caused by some issues with CUDA 12.8. Thank you for pointing it out, a workaround has been applied. |
|
Hi, dear DeepEP developers, I'm interested in replacing nvshmem with NCCL GIN — what would be the main benefits of making that switch? |
thx very much. we can build and run on cuda12.8 using branch: https://github.com/deepseek-ai/DeepEP/tree/try-fix-cu128 |
|
It seems that we don't use companion QP at all. It's a waste of the qps. Is there any to avoid creating companion qp? |
|
* enhance: add env:EP_NIC_NAME to config nic name * enhance: add env:EP_NIC_NAME to config nic name * enhance: add env:EP_NIC_NAME to config nic name * enhance: add env:EP_NIC_NAME to config nic name * enhance: add env:EP_NIC_NAME to config nic name --------- Co-authored-by: fujianhao.fjh <fujianhao.fjh@alipay.com>
The first will be optimized here in nccl to avoid creating companion QPs. Tracked here NVIDIA/nccl#2134. |
|
We found the |
|
This looks like an issue with NCCL dev_comm. Could you help take a look? Thank you very much! |
Ignore this issue. This is because my two nodes are not in the same nvlink-domain. |
Bumps the deep_ep git pin in pyproject.toml from bfded348 (2025-10-29, pre-V2) to b306af0 (2026-04-29), which is the merge commit of DeepEP PR NVIDIA-NeMo#605 "Introducing EPv2". Why --- The current pin predates the DeepEP V2 API (ElasticBuffer, PP/CP/Engram support). Consumers of NeMo-RL's Megatron backend that follow NVIDIA/Megatron-LM#4632 ("Shape Y" Megatron V2 adoption) cannot resolve deep_ep.ElasticBuffer with the current pin; the virtualenv still installs the pre-V2 tree. This change bumps only the pin. It does not by itself change any NeMo-RL code path. Paired with Megatron-LM#4632, it enables the end-to-end V2 path that is already running on AWS p5en.48xlarge 2x H200 in the reproduction repo below. Upstream references ------------------- * deepseek-ai/DeepEP#605 (V2 merge 2026-04-29) * NVIDIA/Megatron-LM#4632 (Megatron-side V2 adoption) Reproduction ------------ End-to-end reproduction (Dockerfile + K8s manifests + smoke bench) is public at: https://github.com/antonai-work/nemo-rl-deepep-v2-efa Related NeMo-RL PR (separate concern, same fleet): NVIDIA-NeMo#2410 (Dockerfile LD_LIBRARY_PATH for EFA OFI discovery) Signed-off-by: Anton Alexander <antonai@users.noreply.github.com>
…itch
Add `RTP_LLM_DEEPEP_BACKEND={legacy,elastic}` runtime switch so rtp-llm can
keep the v1-compatible `deep_ep::legacy::Buffer` path (default) and
opt into the v2 unified `ElasticBuffer` (PR deepseek-ai/DeepEP#605) without
forking the engine. Unknown values silently fall back to legacy so a typo
never flips a production deploy to v2.
- deepep_wrapper.py: add DeepEPBackend / deepep_backend() / is_elastic_buffer
helpers, branch _init_deepep_buffer at the entry, and add a new
_init_elastic_buffer that uses ElasticBuffer.get_buffer_size_hint to size
the unified buffer. The three legacy init paths
(_init_normal_buffer / _init_low_latency_buffer / _init_low_latency_m2n_buffer)
are kept verbatim so v1 behavior is preserved.
- deepep_normal_router.py: branch prepare()/finalize() at the entry on
is_elastic_buffer(); the elastic path skips get_dispatch_layout (unified
API computes layout internally), uses the v2 dispatch 5-tuple and reads
num_recv_tokens_per_expert_list from EPHandle. The legacy 6-tuple call
site is untouched.
- deepep_low_latency_router.py: collapse the v1 two-API setup
(low_latency_dispatch / low_latency_combine) onto ElasticBuffer's unified
dispatch / combine when the elastic backend is active. _is_elastic is
cached at construction time so prepare()/finalize() only branch once.
…itch
Add `RTP_LLM_DEEPEP_BACKEND={legacy,elastic}` runtime switch so rtp-llm can
keep the v1-compatible `deep_ep::legacy::Buffer` path (default) and
opt into the v2 unified `ElasticBuffer` (PR deepseek-ai/DeepEP#605) without
forking the engine. Unknown values silently fall back to legacy so a typo
never flips a production deploy to v2.
- deepep_wrapper.py: add DeepEPBackend / deepep_backend() / is_elastic_buffer
helpers, branch _init_deepep_buffer at the entry, and add a new
_init_elastic_buffer that uses ElasticBuffer.get_buffer_size_hint to size
the unified buffer. The three legacy init paths
(_init_normal_buffer / _init_low_latency_buffer / _init_low_latency_m2n_buffer)
are kept verbatim so v1 behavior is preserved.
- deepep_normal_router.py: branch prepare()/finalize() at the entry on
is_elastic_buffer(); the elastic path skips get_dispatch_layout (unified
API computes layout internally), uses the v2 dispatch 5-tuple and reads
num_recv_tokens_per_expert_list from EPHandle. The legacy 6-tuple call
site is untouched.
- deepep_low_latency_router.py: collapse the v1 two-API setup
(low_latency_dispatch / low_latency_combine) onto ElasticBuffer's unified
dispatch / combine when the elastic backend is active. _is_elastic is
cached at construction time so prepare()/finalize() only branch once.


With the evolution of hardware, networking, and model architectures, the previous DeepEP V1 had accumulated too much legacy baggage and performance issues. Today, we are excited to introduce DeepEP V2, which includes a complete refactoring of Expert Parallelism — achieving extreme performance with several times fewer SM resources compared to V1, while supporting significantly larger scale-up and scale-out domains — as well as experimental 0 SM Engram, 0 SM Pipeline Parallelism, and 0 SM Context Parallelism all-gather.
We are also happy to announce that we have switched from the NVSHMEM backend to the more lightweight NCCL Gin backend.
New Features
Notes
Still On-going Features
Performance
Following V3's configuration, we tested with 8K tokens per batch, 7168 hidden dimensions, top 8 experts, FP8 dispatching, and BF16 combining, and obtained the following results:
Notes, the results are logical bandwidth. For example, under the
EP 8 x 2case, 90 GB/s actually contains local rank traffic.Comparing with V1, V2 achieves up to 1.3x peak performance, while saving up to 4x SM count.
We omit results for larger EP configurations for the time being, but encourage interested users to benchmark them directly. Based on our internal experience, we expect the kernel to continue saturating hardware bandwidth at scale.
Contributors