Ep normal #415

MaoZiming · 2025-09-27T20:36:58Z

Description

Please include a summary of the changes and the related issue.

Making normal mode work on EFA. To test:

export OMP_NUM_THREADS=4
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

…/uccl into run-original-deepEP

…ormal

…ad_id = 32

YangZhou1997 · 2025-10-04T23:47:16Z

This is great!!!

…ormal

MaoZiming · 2025-10-05T02:34:43Z

@YangZhou1997 Hey I got normal mode to work, and test_internode.py can run. I will clean it up tomorrow and make sure existing tests work.
To try

export OMP_NUM_THREADS=4

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
  --master_addr=10.1.239.25 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=1 --num-experts=32 --test-ll-compatibility > node0.log 2>&1


torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
  --master_addr=10.1.239.25 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=1 --num-experts=32 --test-ll-compatibility > node1.log 2>&1

MaoZiming · 2025-10-05T02:57:57Z

EP 32. --num-tokens=128 --hidden=7168 --num-topk=8 --num-experts=288

[tuning] Best dispatch (BF16): SMs 24, NVL chunk 44, RDMA chunk 28, transmit: 399.73 us, notify: 112.81 us, BW: 16.21 GB/s (RDMA), 34.11 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 2, RDMA chunk 12, transmit: 761.81 us, notify: 305.31 us, BW: 8.51 GB/s (RDMA), 17.90 GB/s (NVL)

YangZhou1997 · 2025-10-05T03:50:00Z

Congrats! This is a very promising result! I notice the result was obtained with --num-tokens=128, I guess with --num-tokens=4096 (the pretraining setting, where DeepEP got their 58 GB/s for EP32), we should be able to get larger batches and thus even higher results.

YangZhou1997 · 2025-10-05T04:02:56Z

Update on EP16 test:

--num-tokens=128 --test-ll-compatibility works:

Best dispatch (BF16): SMs 24, NVL chunk 12, RDMA chunk 12, transmit: 236.71 us, notify: 168.37 us, BW: 15.44 GB/s (RDMA), 49.84 GB/s (NVL)
Best combine: SMs 24, NVL chunk 5, RDMA chunk 12, transmit: 906.94 us, notify: 297.65 us, BW: 4.03 GB/s (RDMA), 13.01 GB/s (NVL)

--num-tokens=256 --test-ll-compatibility works:

Best dispatch (BF16): SMs 24, NVL chunk 24, RDMA chunk 12, transmit: 322.00 us, notify: 119.52 us, BW: 22.66 GB/s (RDMA), 72.48 GB/s (NVL)
Best combine: SMs 24, NVL chunk 7, RDMA chunk 12, transmit: 2335.00 us, notify: 220.97 us, BW: 3.13 GB/s (RDMA), 10.00 GB/s (NVL)

--num-tokens=512 --test-ll-compatibility works for many tuning runs, but fails in the end. Similar for 1024 tokens:

Too many atomic operations: 1054 > 1024

--num-tokens=4096 --test-ll-compatibility works for a few tuning runs, but fails in a couple of runs, and shows low bandwidth.

Node 0:
Node 1:

Removing --test-ll-compatibility leads to timeout errors:

MaoZiming and others added 30 commits September 24, 2025 23:13

running original deepep

175c6c9

fix

200cc88

conflict

1cfa40b

import detect_ib_hca

524a47b

fix

997617a

Merge branch 'run-original-deepEP' of https://github.com/uccl-project…

e6de9b1

…/uccl into run-original-deepEP

ignore draft

85eae41

build on host

ebc49b0

ep normal mode

bef4130

Merge branch 'main' of https://github.com/uccl-project/uccl into ep-n…

9d12719

…ormal

adding barrier and quiet

f4cbe97

limit nvshmemi_ibgda_quiet and nvshmem_sync_with_same_gpu_idx to thre…

eeb4261

…ad_id = 32

adding toy barrier quiet and wait_until_cmd_consumed

4e2ff95

wip remote write OOB

a405f91

adding hierarchical barrier

f4189b4

adding translate_dst_rdma_rank

3a3df32

fixing low latency buffer idx and rank problem

e5c2acb

debug

1a8de50

fix atomic is_combine issue

4a1bf15

debugging DeepEP dispatch NVL receiver timeout

cbeeb19

remove extra code

0062334

remove redundant print

4198d6b

git revert relaxed

ebf6c5c

embed atomics together with write

22962b1

debugging

56a2eb2

format

74a3690

check

cb4630f

fix dispatch error

7bf996f

debugging combine

e8649c2

seems roughly working

087ccfc

MaoZiming added 7 commits October 5, 2025 00:00

clean and fix

d832edf

cleaning

a46adb4

clean and run

2dac98e

clean

61ee19e

revert deepEP tests

133c5b4

Merge branch 'main' of https://github.com/uccl-project/uccl into ep-n…

026a7e6

…ormal

it works!

fc8d487

clean

ef4dccb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ep normal #415

Ep normal #415

MaoZiming commented Sep 27, 2025 •

edited

Loading

Uh oh!

YangZhou1997 commented Oct 4, 2025

Uh oh!

MaoZiming commented Oct 5, 2025

Uh oh!

MaoZiming commented Oct 5, 2025

Uh oh!

YangZhou1997 commented Oct 5, 2025

Uh oh!

YangZhou1997 commented Oct 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Ep normal #415

Are you sure you want to change the base?

Ep normal #415

Conversation

MaoZiming commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

YangZhou1997 commented Oct 4, 2025

Uh oh!

MaoZiming commented Oct 5, 2025

Uh oh!

MaoZiming commented Oct 5, 2025

Uh oh!

YangZhou1997 commented Oct 5, 2025

Uh oh!

YangZhou1997 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MaoZiming commented Sep 27, 2025 •

edited

Loading

YangZhou1997 commented Oct 5, 2025 •

edited

Loading