Open
Description
I am seeing bad perf for one-node TL/CUDA/allgather on GPU connected through nvLink.
- TL/CUDA gives bad performance compared to NCCL
- Degradation when moving from V100 to H100
On H100
Setup DGX 8*H100, one node
osu benchmark's osu_iallgather
# OSU MPI-CUDA Non-blocking Allgather Latency Test v5.3
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Overlap(%)
1 4624.56 2375.31 1.06 0.00 2248.07 2240.93 0.00
2 4593.15 2359.63 1.03 0.00 2232.37 2225.42 0.00
4 4591.75 2360.01 1.01 0.00 2230.61 2225.75 0.00
8 4579.67 2351.04 1.00 0.00 2227.52 2216.69 0.00
16 4583.98 2361.42 1.02 0.00 2221.42 2227.04 0.20
32 4589.30 2359.94 1.02 0.00 2228.22 2224.22 0.00
64 4576.54 2351.81 1.06 0.00 2223.55 2218.14 0.00
128 4571.41 2350.44 1.09 0.00 2219.76 2216.57 0.00
256 4575.16 2351.01 1.05 0.00 2222.99 2217.75 0.00
512 4566.17 2348.72 1.04 0.00 2216.29 2213.87 0.00
1024 4575.73 2352.01 1.04 0.00 2222.56 2218.21 0.00
2048 4565.34 2344.31 1.05 0.00 2219.87 2211.15 0.00
4096 4591.18 2354.36 1.04 0.00 2235.66 2220.32 0.00
8192 4570.86 2348.94 1.05 0.00 2220.76 2214.09 0.00
16384 4583.58 2359.64 1.03 0.00 2222.79 2225.24 0.06
32768 4583.11 2361.80 1.09 0.00 2220.10 2227.92 0.30
65536 4628.34 2387.93 1.09 0.00 2239.20 2252.50 0.54
131072 4653.04 2392.25 1.03 0.00 2259.65 2257.66 0.00
262144 7049.76 3620.83 1.05 0.00 3427.77 3417.02 0.00
524288 11883.34 6121.60 1.05 0.00 5760.58 5777.57 0.27
1048576 21578.54 11117.81 1.13 0.00 10459.48 10492.01 0.30
osu benchmark osu_allgather
# OSU MPI-CUDA Allgather Latency Test v5.3
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 2262.29 2261.51 2263.22 1000
2 2257.14 2256.17 2257.86 1000
4 2240.42 2239.81 2241.13 1000
8 2236.42 2235.77 2237.15 1000
16 2244.96 2244.25 2245.61 1000
32 2244.62 2244.11 2245.31 1000
64 2244.98 2244.25 2245.60 1000
128 2237.84 2237.12 2238.45 1000
256 2241.49 2240.86 2242.20 1000
512 2242.34 2241.76 2243.01 1000
1024 2238.28 2237.66 2239.01 1000
2048 2236.59 2235.95 2237.33 1000
4096 2235.79 2234.98 2236.38 1000
8192 2239.08 2238.47 2239.81 1000
16384 2240.43 2239.80 2241.15 100
32768 2230.02 2229.37 2230.73 100
65536 2261.99 2261.32 2262.67 100
131072 2274.25 2273.67 2274.94 100
262144 3441.64 3440.99 3442.40 100
524288 5782.01 5781.34 5782.67 100
1048576 10538.80 10538.15 10539.57 100
nccl-test
> mpirun -np 8 nccl-tests/build/all_gather_perf -b 1 -e 1048576 -f 2
# nThread 1 nGpus 1 minBytes 1 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 16092 on hgx-isr1-026 device 0 [0x19] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 16093 on hgx-isr1-026 device 1 [0x3b] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 16094 on hgx-isr1-026 device 2 [0x4c] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 16095 on hgx-isr1-026 device 3 [0x5d] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 16096 on hgx-isr1-026 device 4 [0x9b] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 16097 on hgx-isr1-026 device 5 [0xbb] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 16098 on hgx-isr1-026 device 6 [0xcb] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 16099 on hgx-isr1-026 device 7 [0xdb] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
0 0 float none -1 0.21 0.00 0.00 0 0.15 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
0 0 float none -1 0.14 0.00 0.00 0 0.14 0.00 0.00 0
128 4 float none -1 16.58 0.01 0.01 0 10.66 0.01 0.01 0
256 8 float none -1 10.65 0.02 0.02 0 10.74 0.02 0.02 0
512 16 float none -1 10.77 0.05 0.04 0 10.75 0.05 0.04 0
1024 32 float none -1 10.93 0.09 0.08 0 10.87 0.09 0.08 0
2048 64 float none -1 10.85 0.19 0.17 0 10.93 0.19 0.16 0
4096 128 float none -1 11.09 0.37 0.32 0 11.12 0.37 0.32 0
8192 256 float none -1 11.18 0.73 0.64 0 11.13 0.74 0.64 0
16384 512 float none -1 11.76 1.39 1.22 0 11.59 1.41 1.24 0
32768 1024 float none -1 13.73 2.39 2.09 0 13.35 2.45 2.15 0
65536 2048 float none -1 13.98 4.69 4.10 0 13.62 4.81 4.21 0
131072 4096 float none -1 14.10 9.30 8.14 0 13.77 9.52 8.33 0
262144 8192 float none -1 14.47 18.12 15.85 0 14.13 18.56 16.24 0
524288 16384 float none -1 14.46 36.25 31.71 0 14.19 36.96 32.34 0
1048576 32768 float none -1 18.03 58.14 50.88 0 17.64 59.44 52.01 0
ucc perftest
- with TL/CUDA
# mpirun -np 8 /opt/hpcx/ucc/bin/ucc_perftest -c allgather -m cuda -T -F -b 1 -e 1048576
Collective: Allgather
Memory type: cuda
Datatype: float32
Reduction: N/A
Inplace: 0
Warmup:
small 100
large 20
Iterations:
small 1000
large 200
Count Size Time, us Bandwidth, GB/s
avg min max avg max min
1 4 265.24 264.53 266.10 0.00 0.00 0.00
2 8 261.62 260.94 262.49 0.00 0.00 0.00
4 16 262.19 261.49 263.08 0.00 0.00 0.00
8 32 262.30 261.59 263.21 0.00 0.00 0.00
16 64 264.73 264.03 265.57 0.00 0.00 0.00
32 128 261.92 260.97 263.03 0.00 0.00 0.00
64 256 264.03 263.32 264.88 0.01 0.01 0.01
128 512 262.26 261.52 263.07 0.01 0.01 0.01
256 1024 266.98 266.28 267.83 0.03 0.03 0.03
512 2048 262.37 261.72 263.23 0.05 0.05 0.05
1024 4096 260.95 260.20 261.72 0.11 0.11 0.11
2048 8192 263.06 262.34 263.88 0.22 0.22 0.22
4096 16384 264.29 263.58 265.14 0.43 0.44 0.43
8192 32768 270.65 269.98 271.51 0.85 0.85 0.84
16384 65536 279.28 278.55 280.13 1.64 1.65 1.64
32768 131072 303.39 302.69 304.15 3.02 3.03 3.02
65536 262144 567.98 567.13 568.86 3.23 3.24 3.23
131072 524288 1116.57 1115.74 1117.38 3.29 3.29 3.28
262144 1048576 2178.69 2177.89 2179.49 3.37 3.37 3.37
524288 2097152 4320.53 4319.85 4321.51 3.40 3.40 3.40
1048576 4194304 8602.19 8601.19 8602.90 3.41 3.41 3.41
- with TL/NCCL
# mpirun -np 8 -x UCC_TL_NCCL_TUNE=inf /opt/hpcx/ucc/bin/ucc_perftest -c allgather -m cuda -T -F -b 1 -e 1048576
Collective: Allgather
Memory type: cuda
Datatype: float32
Reduction: N/A
Inplace: 0
Warmup:
small 100
large 20
Iterations:
small 1000
large 200
Count Size Time, us Bandwidth, GB/s
avg min max avg max min
1 4 18.35 17.36 18.77 0.00 0.00 0.00
2 8 18.58 17.54 19.00 0.00 0.00 0.00
4 16 18.26 17.27 18.65 0.01 0.01 0.01
8 32 18.19 17.25 18.59 0.01 0.01 0.01
16 64 18.36 18.11 18.63 0.02 0.02 0.02
32 128 18.25 17.31 18.68 0.05 0.05 0.05
64 256 18.32 17.35 18.75 0.10 0.10 0.10
128 512 18.40 17.44 18.84 0.19 0.21 0.19
256 1024 18.64 17.70 19.02 0.38 0.41 0.38
512 2048 19.22 18.24 19.67 0.75 0.79 0.73
1024 4096 21.01 20.28 21.60 1.36 1.41 1.33
2048 8192 21.72 20.99 22.37 2.64 2.73 2.56
4096 16384 21.61 20.90 22.20 5.31 5.49 5.17
8192 32768 22.06 21.36 22.72 10.40 10.74 10.10
16384 65536 22.36 21.62 22.98 20.52 21.21 19.96
32768 131072 25.83 25.11 26.53 35.52 36.54 34.58
65536 262144 31.39 30.81 31.82 58.46 59.56 57.66
131072 524288 41.95 41.52 42.49 87.49 88.40 86.37
262144 1048576 52.05 51.45 52.52 141.03 142.66 139.76
524288 2097152 75.07 74.11 75.80 195.55 198.08 193.67
1048576 4194304 126.32 125.00 132.75 232.43 234.87 221.16
On V100
- Setup DGX 8*V100, one node
- osu benchmark's
osu iallgather
# OSU MPI-CUDA Non-blocking Allgather Latency Test v5.3
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Overlap(%)
1 2640.04 1364.66 1.88 0.00 1273.29 1317.53 3.20
2 2617.64 1349.89 1.91 0.00 1265.63 1303.53 2.74
4 2608.17 1343.79 1.91 0.00 1262.27 1297.65 2.56
8 2612.70 1347.53 2.00 0.00 1262.97 1301.21 2.77
16 2629.03 1351.41 1.89 0.00 1275.52 1304.94 2.09
32 2623.70 1353.98 2.05 0.00 1267.47 1307.64 2.90
64 2610.40 1348.38 2.10 0.00 1259.70 1301.98 3.07
128 2604.79 1346.67 2.10 0.00 1255.81 1307.71 3.79
256 2602.58 1344.22 2.10 0.00 1256.06 1298.11 3.06
512 2613.51 1351.70 2.08 0.00 1259.52 1305.29 3.33
1024 2608.89 1346.59 2.03 0.00 1260.06 1300.46 2.93
2048 2617.24 1347.61 1.93 0.00 1267.49 1301.33 2.44
4096 2612.47 1345.20 1.91 0.00 1265.15 1299.14 2.45
8192 2614.04 1349.67 2.06 0.00 1262.11 1303.37 2.99
16384 2641.96 1352.58 1.82 0.00 1287.36 1305.98 1.27
32768 2640.10 1363.25 2.03 0.00 1274.62 1316.47 3.01
65536 2662.18 1366.77 2.03 0.00 1293.17 1319.48 1.82
131072 2726.72 1412.95 2.07 0.00 1311.51 1364.44 3.71
262144 4164.28 2133.59 2.09 0.00 2028.40 2060.84 1.46
524288 7086.97 3622.83 2.34 0.00 3461.57 3499.48 1.01
1048576 12917.70 6579.70 2.31 0.00 6335.49 6355.80 0.28
- osu benchmark osu_allgather
# OSU MPI-CUDA Allgather Latency Test v5.3
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1337.31 1336.51 1338.02 1000
2 1319.25 1318.69 1320.30 1000
4 1319.52 1319.04 1320.45 1000
8 1318.86 1318.44 1319.47 1000
16 1317.45 1317.05 1318.04 1000
32 1318.12 1317.70 1318.70 1000
64 1316.61 1316.13 1317.17 1000
128 1320.49 1320.06 1321.07 1000
256 1321.56 1321.12 1322.18 1000
512 1317.89 1317.47 1318.51 1000
1024 1321.17 1320.71 1321.81 1000
2048 1318.51 1318.06 1319.11 1000
4096 1319.92 1319.48 1320.53 1000
8192 1326.04 1325.58 1326.65 1000
16384 1334.55 1334.13 1335.19 100
32768 1341.45 1340.93 1342.17 100
65536 1355.90 1355.49 1356.54 100
131072 1369.65 1369.19 1370.40 100
262144 2080.14 2079.76 2080.74 100
524288 3534.73 3534.31 3535.26 100
1048576 6413.02 6412.56 6413.62 100
- nccl perftest
# nThread 1 nGpus 1 minBytes 1 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 10094 on dgx1v-loki-23 device 0 [0x06] Tesla V100-SXM2-32GB
# Rank 1 Group 0 Pid 10095 on dgx1v-loki-23 device 1 [0x07] Tesla V100-SXM2-32GB
# Rank 2 Group 0 Pid 10096 on dgx1v-loki-23 device 2 [0x0a] Tesla V100-SXM2-32GB
# Rank 3 Group 0 Pid 10097 on dgx1v-loki-23 device 3 [0x0b] Tesla V100-SXM2-32GB
# Rank 4 Group 0 Pid 10098 on dgx1v-loki-23 device 4 [0x85] Tesla V100-SXM2-32GB
# Rank 5 Group 0 Pid 10099 on dgx1v-loki-23 device 5 [0x86] Tesla V100-SXM2-32GB
# Rank 6 Group 0 Pid 10100 on dgx1v-loki-23 device 6 [0x89] Tesla V100-SXM2-32GB
# Rank 7 Group 0 Pid 10105 on dgx1v-loki-23 device 7 [0x8a] Tesla V100-SXM2-32GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
0 0 float none -1 0.22 0.00 0.00 0 0.19 0.00 0.00 0
0 0 float none -1 0.20 0.00 0.00 0 0.18 0.00 0.00 0
0 0 float none -1 0.19 0.00 0.00 0 0.18 0.00 0.00 0
0 0 float none -1 0.18 0.00 0.00 0 0.18 0.00 0.00 0
0 0 float none -1 0.18 0.00 0.00 0 0.18 0.00 0.00 0
0 0 float none -1 0.18 0.00 0.00 0 0.18 0.00 0.00 0
0 0 float none -1 0.18 0.00 0.00 0 0.18 0.00 0.00 0
128 4 float none -1 11.69 0.01 0.01 0 11.77 0.01 0.01 0
256 8 float none -1 11.79 0.02 0.02 0 11.84 0.02 0.02 0
512 16 float none -1 12.02 0.04 0.04 0 12.09 0.04 0.04 0
1024 32 float none -1 12.32 0.08 0.07 0 12.05 0.08 0.07 0
2048 64 float none -1 12.76 0.16 0.14 0 12.40 0.17 0.14 0
4096 128 float none -1 13.35 0.31 0.27 0 12.23 0.33 0.29 0
8192 256 float none -1 13.30 0.62 0.54 0 12.96 0.63 0.55 0
16384 512 float none -1 16.86 0.97 0.85 0 15.62 1.05 0.92 0
32768 1024 float none -1 22.43 1.46 1.28 0 20.76 1.58 1.38 0
65536 2048 float none -1 22.90 2.86 2.50 0 21.84 3.00 2.63 0
131072 4096 float none -1 23.91 5.48 4.80 0 22.11 5.93 5.19 0
262144 8192 float none -1 23.87 10.98 9.61 0 22.28 11.76 10.29 0
524288 16384 float none -1 29.28 17.91 15.67 0 28.46 18.42 16.12 0
1048576 32768 float none -1 43.12 24.32 21.28 0 40.64 25.80 22.57 0
reproducer
osu-benchmarks
docker run \
--rm --net=host --uts=host --ipc=host \
--ulimit stack=67108864 --ulimit \
memlock=-1 --security-opt seccomp=unconfined \
--cap-add=SYS_ADMIN --cap-add=SYS_PTRACE \
--privileged \
--device=/dev/infiniband \
--gpus all \
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
/bin/bash -c '
apt-get update
apt-get install -y automake autoconf
# install osu-benchmark
git clone https://github.com/forresti/osu-micro-benchmarks.git
cd osu-micro-benchmarks
autoreconf -f -i
./configure --enable-cuda --with-cuda-include=/usr/local/cuda/include --with-cuda-libpath=/usr/local/cuda/lib64
make -j
make -j install
cd ..
#install nccl-test
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make CUDA_HOME=/usr/local/cuda NCCL_HOME=/lib/x86_64-linux-gnu MPI=1 MPI_HOME=/usr/local/mpi
cd ..
# run osu-benchmark test
mpirun \
-np 8 \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
--mca coll_ucc_priority 100 \
osu-micro-benchmarks/mpi/collective/osu_iallgather \
-d cuda -f
mpirun \
-np 8 \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
--mca coll_ucc_priority 100 \
osu-micro-benchmarks/mpi/collective/osu_allgather \
-d cuda -f
#run nccl-test
mpirun \
-np 8 \
nccl-tests/build/all_gather_perf \
-b 1 -e 1048576 -f 2
# run ucc perftest with TL/CUDA
mpirun \
-np 8 \
/opt/hpcx/ucc/bin/ucc_perftest \
-c allgather -m cuda -T -F -b 1 -e 1048576
# run ucc perftest with TL/NCCL
mpirun \
-np 8 \
-x UCC_TL_NCCL_TUNE=inf \
/opt/hpcx/ucc/bin/ucc_perftest \
-c allgather -m cuda -T -F -b 1 -e 1048576
'
nvFuser Overlap benchmark
nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's
- nccl's latency 4.86517 ms
- ucc's latency 263.535 ms.
reproducer:
docker run \
--rm --net=host --uts=host --ipc=host \
--ulimit stack=67108864 --ulimit \
memlock=-1 --security-opt seccomp=unconfined \
--cap-add=SYS_ADMIN --cap-add=SYS_PTRACE \
--privileged \
--device=/dev/infiniband \
--gpus all \
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
/bin/bash -c '
git clone https://github.com/samnordmann/Fuser.git
cd Fuser
git checkout origin/overlap_bench/first_experiments
git submodule sync --recursive
git submodule update --init --recursive;
export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"
python setup.py --build-with-ucc --no-benchmark --no-python develop
mpirun --allow-run-as-root -x UCC_COLL_TRACE=info -x UCC_CL_BASIC_TLS=^mlx5 -np 8 $BUILD_DIRECTORY/test_multidevice --gtest_filter=OverlapBenchmark.DummyBenchmark/*_S1_M32768_K1024_N1024_Streams8
'
Metadata
Metadata
Assignees
Labels
No labels