Skip to content

feat: integrate deepgemm into EPMoE #5805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

TianQiLin666666
Copy link

@TianQiLin666666 TianQiLin666666 commented Apr 28, 2025

Motivation

For normal EPMoE (no DeepEP), integrate DeepGEMM as an option.

Modifications

  1. Add forward_deepgemm in EPMoE. Use env EPMOE_USE_DEEPGEMM to enable it.
  2. Add some triton kernels for PreRecord and PostRecord of forward_deepgemm.

Evaluation

Speed

With 2H20-96G8 for EP16, enabling EPMOE_USE_DEEPGEMM leads to a 14% throughput gain.

  • Disable EPMOE_USE_DEEPGEMM
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  280.53
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285278
Request throughput (req/s):              0.68
Input token throughput (tok/s):          2395.46
Output token throughput (tok/s):         1026.63
Total token throughput (tok/s):          3422.09
Concurrency:                             53.17
Accept length:                           3.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   77682.09
Median E2E Latency (ms):                 79075.32
---------------Time to First Token----------------
Mean TTFT (ms):                          4389.81
Median TTFT (ms):                        758.88
P99 TTFT (ms):                           21329.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.90
Median ITL (ms):                         34.40
P95 ITL (ms):                            121.80
P99 ITL (ms):                            264.88
Max ITL (ms):                            21210.85
==================================================
  • Enable EPMOE_USE_DEEPGEMM
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  246.03
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285757
Request throughput (req/s):              0.78
Input token throughput (tok/s):          2731.37
Output token throughput (tok/s):         1170.59
Total token throughput (tok/s):          3901.96
Concurrency:                             58.55
Accept length:                           3.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75031.17
Median E2E Latency (ms):                 75272.64
---------------Time to First Token----------------
Mean TTFT (ms):                          4070.91
Median TTFT (ms):                        597.56
P99 TTFT (ms):                           19711.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.35
Median ITL (ms):                         34.07
P95 ITL (ms):                            118.96
P99 ITL (ms):                            244.79
Max ITL (ms):                            19567.36
==================================================
  • Launch server commands
# node0
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 0 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# node1
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 1 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# client
python3 -m sglang.bench_serving --backend sglang \
            --dataset-name random \
            --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --random-range-ratio 1 \
            --request-rate 74 \
            --max-concurrency 74 \
            --num-prompts 296 \
            --host 0.0.0.0 --port 8000

Accuracy

MMLU test with mmlu/bench_sglang.py

100%|██████████████████████████████████████████████████████████████████| 1369/1369 [00:47<00:00, 28.74it/s]
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.852
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.925
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.890
Total latency: 47.643
Average accuracy: 0.865

Checklist

@@ -47,6 +55,8 @@

logger = logging.getLogger(__name__)

epmoe_use_deepgemm = get_bool_env_var("EPMOE_USE_DEEPGEMM")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


We might import it directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you mean we just replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, enabling _ENABLE_JIT_DEEPGEMM will set deepgemm at epmoe as the default configuration.

def forward(self, hidden_states: torch.Tensor, router_logits: torch.Tensor):
if use_deep_gemm and epmoe_use_deepgemm:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disable EPMOE DeepGEMM when use_deep_gemm is enabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe forward_deepgemm is called when use_deep_gemm is enabled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where Triton GEMM in forward_normal outperforms DeepGEMM?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for now, I didn't find any case where Triton GEMM in forward_normal outperforms DeepGEMM, but DeepGEMM may occupy more GPU memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove epmoe_use_deepgemm and corresponding Environment variable EPMOE_USE_DEEPGEMM for the sake of clarity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done

@TianQiLin666666
Copy link
Author

@xutizhou Could you please help me merge this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants