Skip to content

Conversation

MaoZiming
Copy link
Member

@MaoZiming MaoZiming commented Sep 27, 2025

Description

Please include a summary of the changes and the related issue.

Making normal mode work on EFA. To test:

export OMP_NUM_THREADS=4
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 \
  --master_addr=10.1.59.30 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@YangZhou1997
Copy link
Member

This is great!!!

@MaoZiming
Copy link
Member Author

@YangZhou1997 Hey I got normal mode to work, and test_internode.py can run. I will clean it up tomorrow and make sure existing tests work.
To try

export OMP_NUM_THREADS=4

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
  --master_addr=10.1.239.25 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=1 --num-experts=32 --test-ll-compatibility > node0.log 2>&1


torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
  --master_addr=10.1.239.25 --master_port=12355 \
  bench/test_internode.py --num-tokens=128 \
  --hidden=7168 --num-topk=1 --num-experts=32 --test-ll-compatibility > node1.log 2>&1

@MaoZiming
Copy link
Member Author

EP 32. --num-tokens=128 --hidden=7168 --num-topk=8 --num-experts=288

[tuning] Best dispatch (BF16): SMs 24, NVL chunk 44, RDMA chunk 28, transmit: 399.73 us, notify: 112.81 us, BW: 16.21 GB/s (RDMA), 34.11 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 2, RDMA chunk 12, transmit: 761.81 us, notify: 305.31 us, BW: 8.51 GB/s (RDMA), 17.90 GB/s (NVL)

@YangZhou1997
Copy link
Member

Congrats! This is a very promising result! I notice the result was obtained with --num-tokens=128, I guess with --num-tokens=4096 (the pretraining setting, where DeepEP got their 58 GB/s for EP32), we should be able to get larger batches and thus even higher results.

@YangZhou1997
Copy link
Member

YangZhou1997 commented Oct 5, 2025

Update on EP16 test:

--num-tokens=128 --test-ll-compatibility works:

Best dispatch (BF16): SMs 24, NVL chunk 12, RDMA chunk 12, transmit: 236.71 us, notify: 168.37 us, BW: 15.44 GB/s (RDMA), 49.84 GB/s (NVL)
Best combine: SMs 24, NVL chunk 5, RDMA chunk 12, transmit: 906.94 us, notify: 297.65 us, BW: 4.03 GB/s (RDMA), 13.01 GB/s (NVL)

--num-tokens=256 --test-ll-compatibility works:

Best dispatch (BF16): SMs 24, NVL chunk 24, RDMA chunk 12, transmit: 322.00 us, notify: 119.52 us, BW: 22.66 GB/s (RDMA), 72.48 GB/s (NVL)
Best combine: SMs 24, NVL chunk 7, RDMA chunk 12, transmit: 2335.00 us, notify: 220.97 us, BW: 3.13 GB/s (RDMA), 10.00 GB/s (NVL)

--num-tokens=512 --test-ll-compatibility works for many tuning runs, but fails in the end. Similar for 1024 tokens:

Too many atomic operations: 1054 > 1024

--num-tokens=4096 --test-ll-compatibility works for a few tuning runs, but fails in a couple of runs, and shows low bandwidth.

  • Node 0: image
  • Node 1: image

Removing --test-ll-compatibility leads to timeout errors:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants