Fix ep deployment issues #4084

CUHKSZzxy · 2025-10-30T03:10:08Z

Modifications

Expose deepep env var

Default deepep buffer num sms will raise the following errors on H200 multi-nodes. Therefore, we expose this environment variable to users for configuration. A feasible value on H200 is DEEPEP_BUFFER_NUM_SMS=16.

csrc/kernels/internode.cu:386, condition: ibgda_get_state()->num_rc_per_pe == num_channels or ibgda_get_state()->num_rc_per_pe >= num_sms

This is a known issue in deepep

By default, in megatron, condition fais on ibgda_get_state()->num_rc_per_pe == num_channels || ibgda_get_state()->num_rc_per_pe >= num_sms deepseek-ai/DeepEP#226

Fix DeepEP mode in CUDA graph

Flip DeepEP mode between prefill and decode, and also clear the buffer (performed by the DLBLas side when setting to low latency). Otherwise, it will trigger CUDA illegal memory access in deepep or the following deepgemm kernel, as known in

Manually flip deepep_mode for cuda_graph sgl-project/sglang#11666

Upgrade DeepEP / DeepGEMM / DLBlas / FlashMLA

DeepEP -> v1.2.1
DeepGEMM -> v2.1.1.post3
DLBlas -> v0.0.6
FlashMLA -> commit 1408756 (no official release)

Other modifications

Add some deep_gemm cuda dependencies
Pin torch version to avoid build / runtime version mismatch (leads to undefined symbol for deep_gemm)
Add vim
Add some comments

CUHKSZzxy added 2 commits October 30, 2025 10:44

fix for multi-node ep

cbe8ddb

add deep_gemm jit dependencies

1de8e67

CUHKSZzxy changed the title ~~Fix ep~~ Fix ep deployment issues Oct 30, 2025

CUHKSZzxy marked this pull request as draft October 30, 2025 03:12

windreamer approved these changes Oct 30, 2025

View reviewed changes

windreamer self-requested a review October 30, 2025 03:47

windreamer approved these changes Nov 15, 2025

View reviewed changes

CUHKSZzxy added 5 commits November 24, 2025 16:36

update docker

c17a0e4

merge main

43eb460

set deepep mode for cuda graph

f01cca2

fix

b81e13b

bring back gdrcopy

e95aeb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ep deployment issues #4084

Fix ep deployment issues #4084

CUHKSZzxy commented Oct 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix ep deployment issues #4084

Are you sure you want to change the base?

Fix ep deployment issues #4084

Conversation

CUHKSZzxy commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modifications

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUHKSZzxy commented Oct 30, 2025 •

edited

Loading