feat: integrate deepgemm into EPMoE #5805

TianQiLin666666 · 2025-04-28T03:31:36Z

Motivation

For normal EPMoE (no DeepEP), integrate DeepGEMM as an option.

Modifications

Add forward_deepgemm in EPMoE. Use env EPMOE_USE_DEEPGEMM to enable it.
Add some triton kernels for PreRecord and PostRecord of forward_deepgemm.

Evaluation

Speed

With 2H20-96G8 for EP16, enabling EPMOE_USE_DEEPGEMM leads to a 14% throughput gain.

Disable EPMOE_USE_DEEPGEMM

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  280.53
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285278
Request throughput (req/s):              0.68
Input token throughput (tok/s):          2395.46
Output token throughput (tok/s):         1026.63
Total token throughput (tok/s):          3422.09
Concurrency:                             53.17
Accept length:                           3.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   77682.09
Median E2E Latency (ms):                 79075.32
---------------Time to First Token----------------
Mean TTFT (ms):                          4389.81
Median TTFT (ms):                        758.88
P99 TTFT (ms):                           21329.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.90
Median ITL (ms):                         34.40
P95 ITL (ms):                            121.80
P99 ITL (ms):                            264.88
Max ITL (ms):                            21210.85
==================================================

Enable EPMOE_USE_DEEPGEMM

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  246.03
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285757
Request throughput (req/s):              0.78
Input token throughput (tok/s):          2731.37
Output token throughput (tok/s):         1170.59
Total token throughput (tok/s):          3901.96
Concurrency:                             58.55
Accept length:                           3.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75031.17
Median E2E Latency (ms):                 75272.64
---------------Time to First Token----------------
Mean TTFT (ms):                          4070.91
Median TTFT (ms):                        597.56
P99 TTFT (ms):                           19711.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.35
Median ITL (ms):                         34.07
P95 ITL (ms):                            118.96
P99 ITL (ms):                            244.79
Max ITL (ms):                            19567.36
==================================================

Launch server commands

# node0
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 0 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# node1
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 1 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# client
python3 -m sglang.bench_serving --backend sglang \
            --dataset-name random \
            --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --random-range-ratio 1 \
            --request-rate 74 \
            --max-concurrency 74 \
            --num-prompts 296 \
            --host 0.0.0.0 --port 8000

Accuracy

MMLU test with mmlu/bench_sglang.py

100%|██████████████████████████████████████████████████████████████████| 1369/1369 [00:47<00:00, 28.74it/s]
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.852
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.925
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.890
Total latency: 47.643
Average accuracy: 0.865

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

xutizhou · 2025-04-28T05:46:11Z

python/sglang/srt/layers/moe/ep_moe/layer.py

@@ -47,6 +55,8 @@

 logger = logging.getLogger(__name__)

+epmoe_use_deepgemm = get_bool_env_var("EPMOE_USE_DEEPGEMM")


sglang/python/sglang/srt/layers/quantization/deep_gemm.py

Line 29 in b5be569

_ENABLE_JIT_DEEPGEMM = True

We might import it directly.

So, do you mean we just replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

Yes, enabling _ENABLE_JIT_DEEPGEMM will set deepgemm at epmoe as the default configuration.

xutizhou · 2025-04-28T05:49:01Z

python/sglang/srt/layers/moe/ep_moe/layer.py

    def forward(self, hidden_states: torch.Tensor, router_logits: torch.Tensor):
+        if use_deep_gemm and epmoe_use_deepgemm:


Why disable EPMOE DeepGEMM when use_deep_gemm is enabled?

Maybe forward_deepgemm is called when use_deep_gemm is enabled.

Are there any cases where Triton GEMM in forward_normal outperforms DeepGEMM?

As for now, I didn't find any case where Triton GEMM in forward_normal outperforms DeepGEMM, but DeepGEMM may occupy more GPU memory.

We could remove epmoe_use_deepgemm and corresponding Environment variable EPMOE_USE_DEEPGEMM for the sake of clarity.

TianQiLin666666 · 2025-04-29T11:39:35Z

@xutizhou Could you please help me merge this

tianqilin.99 added 6 commits April 22, 2025 16:50

feat(ep_moe): integrate deepgemm into origin ep moe

92d647c

fix(ep_moe): group_gemm_mask bug

e057acb

fix bugs

19ec50e

fix bugs

3ce1a91

fix(em_moe): offset bugs

3d51a71

fix(deepgemm): bugfix

c80fc3c

TianQiLin666666 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners April 28, 2025 03:31

fix: remove redundant code

af94a8b

zhyncs assigned ch-wan, sleepcoo and xutizhou Apr 28, 2025

tianqilin.99 added 2 commits April 28, 2025 11:43

fix: clang-format

2022070

fix: remove print

55ea483

xutizhou reviewed Apr 28, 2025

View reviewed changes

fix(ep_moe): replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

988a522

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate deepgemm into EPMoE #5805

feat: integrate deepgemm into EPMoE #5805

TianQiLin666666 commented Apr 28, 2025 •

edited

Loading

xutizhou Apr 28, 2025

TianQiLin666666 Apr 28, 2025

xutizhou Apr 28, 2025

xutizhou Apr 28, 2025

TianQiLin666666 Apr 28, 2025

xutizhou Apr 28, 2025

TianQiLin666666 Apr 28, 2025

xutizhou Apr 28, 2025

TianQiLin666666 Apr 28, 2025

TianQiLin666666 commented Apr 29, 2025

		@@ -47,6 +55,8 @@

		logger = logging.getLogger(__name__)

		epmoe_use_deepgemm = get_bool_env_var("EPMOE_USE_DEEPGEMM")

		def forward(self, hidden_states: torch.Tensor, router_logits: torch.Tensor):
		if use_deep_gemm and epmoe_use_deepgemm:

feat: integrate deepgemm into EPMoE #5805

Are you sure you want to change the base?

feat: integrate deepgemm into EPMoE #5805

Conversation

TianQiLin666666 commented Apr 28, 2025 • edited Loading

Motivation

Modifications

Evaluation

Speed

Accuracy

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TianQiLin666666 commented Apr 29, 2025

TianQiLin666666 commented Apr 28, 2025 •

edited

Loading