Performance degradation on TL/CUDA/ALLGATHER

I am seeing bad perf for one-node TL/CUDA/allgather on GPU connected through nvLink.
- TL/CUDA gives bad performance compared to NCCL
- Degradation when moving from V100 to H100


# On H100
**Setup** DGX 8*H100, one node
### osu benchmark's `osu_iallgather`
```
# OSU MPI-CUDA Non-blocking Allgather Latency Test v5.3
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)        Overlap(%)
1                    4624.56           2375.31              1.06              0.00           2248.07           2240.93              0.00
2                    4593.15           2359.63              1.03              0.00           2232.37           2225.42              0.00
4                    4591.75           2360.01              1.01              0.00           2230.61           2225.75              0.00
8                    4579.67           2351.04              1.00              0.00           2227.52           2216.69              0.00
16                   4583.98           2361.42              1.02              0.00           2221.42           2227.04              0.20
32                   4589.30           2359.94              1.02              0.00           2228.22           2224.22              0.00
64                   4576.54           2351.81              1.06              0.00           2223.55           2218.14              0.00
128                  4571.41           2350.44              1.09              0.00           2219.76           2216.57              0.00
256                  4575.16           2351.01              1.05              0.00           2222.99           2217.75              0.00
512                  4566.17           2348.72              1.04              0.00           2216.29           2213.87              0.00
1024                 4575.73           2352.01              1.04              0.00           2222.56           2218.21              0.00
2048                 4565.34           2344.31              1.05              0.00           2219.87           2211.15              0.00
4096                 4591.18           2354.36              1.04              0.00           2235.66           2220.32              0.00
8192                 4570.86           2348.94              1.05              0.00           2220.76           2214.09              0.00
16384                4583.58           2359.64              1.03              0.00           2222.79           2225.24              0.06
32768                4583.11           2361.80              1.09              0.00           2220.10           2227.92              0.30
65536                4628.34           2387.93              1.09              0.00           2239.20           2252.50              0.54
131072               4653.04           2392.25              1.03              0.00           2259.65           2257.66              0.00
262144               7049.76           3620.83              1.05              0.00           3427.77           3417.02              0.00
524288              11883.34           6121.60              1.05              0.00           5760.58           5777.57              0.27
1048576             21578.54          11117.81              1.13              0.00          10459.48          10492.01              0.30
```
### osu benchmark osu_allgather
```
# OSU MPI-CUDA Allgather Latency Test v5.3
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    2262.29           2261.51           2263.22        1000
2                    2257.14           2256.17           2257.86        1000
4                    2240.42           2239.81           2241.13        1000
8                    2236.42           2235.77           2237.15        1000
16                   2244.96           2244.25           2245.61        1000
32                   2244.62           2244.11           2245.31        1000
64                   2244.98           2244.25           2245.60        1000
128                  2237.84           2237.12           2238.45        1000
256                  2241.49           2240.86           2242.20        1000
512                  2242.34           2241.76           2243.01        1000
1024                 2238.28           2237.66           2239.01        1000
2048                 2236.59           2235.95           2237.33        1000
4096                 2235.79           2234.98           2236.38        1000
8192                 2239.08           2238.47           2239.81        1000
16384                2240.43           2239.80           2241.15         100
32768                2230.02           2229.37           2230.73         100
65536                2261.99           2261.32           2262.67         100
131072               2274.25           2273.67           2274.94         100
262144               3441.64           3440.99           3442.40         100
524288               5782.01           5781.34           5782.67         100
1048576             10538.80          10538.15          10539.57         100
```
### nccl-test

```
> mpirun -np 8 nccl-tests/build/all_gather_perf -b 1 -e 1048576 -f 2
# nThread 1 nGpus 1 minBytes 1 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  16092 on hgx-isr1-026 device  0 [0x19] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid  16093 on hgx-isr1-026 device  1 [0x3b] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid  16094 on hgx-isr1-026 device  2 [0x4c] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid  16095 on hgx-isr1-026 device  3 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid  16096 on hgx-isr1-026 device  4 [0x9b] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid  16097 on hgx-isr1-026 device  5 [0xbb] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid  16098 on hgx-isr1-026 device  6 [0xcb] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid  16099 on hgx-isr1-026 device  7 [0xdb] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.21    0.00    0.00      0     0.15    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
           0             0     float    none      -1     0.14    0.00    0.00      0     0.14    0.00    0.00      0
         128             4     float    none      -1    16.58    0.01    0.01      0    10.66    0.01    0.01      0
         256             8     float    none      -1    10.65    0.02    0.02      0    10.74    0.02    0.02      0
         512            16     float    none      -1    10.77    0.05    0.04      0    10.75    0.05    0.04      0
        1024            32     float    none      -1    10.93    0.09    0.08      0    10.87    0.09    0.08      0
        2048            64     float    none      -1    10.85    0.19    0.17      0    10.93    0.19    0.16      0
        4096           128     float    none      -1    11.09    0.37    0.32      0    11.12    0.37    0.32      0
        8192           256     float    none      -1    11.18    0.73    0.64      0    11.13    0.74    0.64      0
       16384           512     float    none      -1    11.76    1.39    1.22      0    11.59    1.41    1.24      0
       32768          1024     float    none      -1    13.73    2.39    2.09      0    13.35    2.45    2.15      0
       65536          2048     float    none      -1    13.98    4.69    4.10      0    13.62    4.81    4.21      0
      131072          4096     float    none      -1    14.10    9.30    8.14      0    13.77    9.52    8.33      0
      262144          8192     float    none      -1    14.47   18.12   15.85      0    14.13   18.56   16.24      0
      524288         16384     float    none      -1    14.46   36.25   31.71      0    14.19   36.96   32.34      0
     1048576         32768     float    none      -1    18.03   58.14   50.88      0    17.64   59.44   52.01      0
```
### ucc perftest
- with TL/CUDA
```
# mpirun -np 8 /opt/hpcx/ucc/bin/ucc_perftest -c allgather -m cuda -T -F -b 1 -e 1048576
Collective:             Allgather
Memory type:            cuda
Datatype:               float32
Reduction:              N/A
Inplace:                0
Warmup:
  small                 100
  large                 20
Iterations:
  small                 1000
  large                 200

       Count        Size                Time, us                           Bandwidth, GB/s
                                 avg         min         max         avg         max         min
           1           4      265.24      264.53      266.10        0.00        0.00        0.00
           2           8      261.62      260.94      262.49        0.00        0.00        0.00
           4          16      262.19      261.49      263.08        0.00        0.00        0.00
           8          32      262.30      261.59      263.21        0.00        0.00        0.00
          16          64      264.73      264.03      265.57        0.00        0.00        0.00
          32         128      261.92      260.97      263.03        0.00        0.00        0.00
          64         256      264.03      263.32      264.88        0.01        0.01        0.01
         128         512      262.26      261.52      263.07        0.01        0.01        0.01
         256        1024      266.98      266.28      267.83        0.03        0.03        0.03
         512        2048      262.37      261.72      263.23        0.05        0.05        0.05
        1024        4096      260.95      260.20      261.72        0.11        0.11        0.11
        2048        8192      263.06      262.34      263.88        0.22        0.22        0.22
        4096       16384      264.29      263.58      265.14        0.43        0.44        0.43
        8192       32768      270.65      269.98      271.51        0.85        0.85        0.84
       16384       65536      279.28      278.55      280.13        1.64        1.65        1.64
       32768      131072      303.39      302.69      304.15        3.02        3.03        3.02
       65536      262144      567.98      567.13      568.86        3.23        3.24        3.23
      131072      524288     1116.57     1115.74     1117.38        3.29        3.29        3.28
      262144     1048576     2178.69     2177.89     2179.49        3.37        3.37        3.37
      524288     2097152     4320.53     4319.85     4321.51        3.40        3.40        3.40
     1048576     4194304     8602.19     8601.19     8602.90        3.41        3.41        3.41
```
- with TL/NCCL
```
# mpirun -np 8 -x UCC_TL_NCCL_TUNE=inf /opt/hpcx/ucc/bin/ucc_perftest -c allgather -m cuda -T -F -b 1 -e 1048576
Collective:             Allgather
Memory type:            cuda
Datatype:               float32
Reduction:              N/A
Inplace:                0
Warmup:
  small                 100
  large                 20
Iterations:
  small                 1000
  large                 200

       Count        Size                Time, us                           Bandwidth, GB/s
                                 avg         min         max         avg         max         min
           1           4       18.35       17.36       18.77        0.00        0.00        0.00
           2           8       18.58       17.54       19.00        0.00        0.00        0.00
           4          16       18.26       17.27       18.65        0.01        0.01        0.01
           8          32       18.19       17.25       18.59        0.01        0.01        0.01
          16          64       18.36       18.11       18.63        0.02        0.02        0.02
          32         128       18.25       17.31       18.68        0.05        0.05        0.05
          64         256       18.32       17.35       18.75        0.10        0.10        0.10
         128         512       18.40       17.44       18.84        0.19        0.21        0.19
         256        1024       18.64       17.70       19.02        0.38        0.41        0.38
         512        2048       19.22       18.24       19.67        0.75        0.79        0.73
        1024        4096       21.01       20.28       21.60        1.36        1.41        1.33
        2048        8192       21.72       20.99       22.37        2.64        2.73        2.56
        4096       16384       21.61       20.90       22.20        5.31        5.49        5.17
        8192       32768       22.06       21.36       22.72       10.40       10.74       10.10
       16384       65536       22.36       21.62       22.98       20.52       21.21       19.96
       32768      131072       25.83       25.11       26.53       35.52       36.54       34.58
       65536      262144       31.39       30.81       31.82       58.46       59.56       57.66
      131072      524288       41.95       41.52       42.49       87.49       88.40       86.37
      262144     1048576       52.05       51.45       52.52      141.03      142.66      139.76
      524288     2097152       75.07       74.11       75.80      195.55      198.08      193.67
     1048576     4194304      126.32      125.00      132.75      232.43      234.87      221.16
```
# On V100
- **Setup** DGX 8*V100, one node
- osu benchmark's `osu iallgather`
```
# OSU MPI-CUDA Non-blocking Allgather Latency Test v5.3
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)        Overlap(%)
1                    2640.04           1364.66              1.88              0.00           1273.29           1317.53              3.20
2                    2617.64           1349.89              1.91              0.00           1265.63           1303.53              2.74
4                    2608.17           1343.79              1.91              0.00           1262.27           1297.65              2.56
8                    2612.70           1347.53              2.00              0.00           1262.97           1301.21              2.77
16                   2629.03           1351.41              1.89              0.00           1275.52           1304.94              2.09
32                   2623.70           1353.98              2.05              0.00           1267.47           1307.64              2.90
64                   2610.40           1348.38              2.10              0.00           1259.70           1301.98              3.07
128                  2604.79           1346.67              2.10              0.00           1255.81           1307.71              3.79
256                  2602.58           1344.22              2.10              0.00           1256.06           1298.11              3.06
512                  2613.51           1351.70              2.08              0.00           1259.52           1305.29              3.33
1024                 2608.89           1346.59              2.03              0.00           1260.06           1300.46              2.93
2048                 2617.24           1347.61              1.93              0.00           1267.49           1301.33              2.44
4096                 2612.47           1345.20              1.91              0.00           1265.15           1299.14              2.45
8192                 2614.04           1349.67              2.06              0.00           1262.11           1303.37              2.99
16384                2641.96           1352.58              1.82              0.00           1287.36           1305.98              1.27
32768                2640.10           1363.25              2.03              0.00           1274.62           1316.47              3.01
65536                2662.18           1366.77              2.03              0.00           1293.17           1319.48              1.82
131072               2726.72           1412.95              2.07              0.00           1311.51           1364.44              3.71
262144               4164.28           2133.59              2.09              0.00           2028.40           2060.84              1.46
524288               7086.97           3622.83              2.34              0.00           3461.57           3499.48              1.01
1048576             12917.70           6579.70              2.31              0.00           6335.49           6355.80              0.28
```
- osu benchmark osu_allgather
```
# OSU MPI-CUDA Allgather Latency Test v5.3
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1337.31           1336.51           1338.02        1000
2                    1319.25           1318.69           1320.30        1000
4                    1319.52           1319.04           1320.45        1000
8                    1318.86           1318.44           1319.47        1000
16                   1317.45           1317.05           1318.04        1000
32                   1318.12           1317.70           1318.70        1000
64                   1316.61           1316.13           1317.17        1000
128                  1320.49           1320.06           1321.07        1000
256                  1321.56           1321.12           1322.18        1000
512                  1317.89           1317.47           1318.51        1000
1024                 1321.17           1320.71           1321.81        1000
2048                 1318.51           1318.06           1319.11        1000
4096                 1319.92           1319.48           1320.53        1000
8192                 1326.04           1325.58           1326.65        1000
16384                1334.55           1334.13           1335.19         100
32768                1341.45           1340.93           1342.17         100
65536                1355.90           1355.49           1356.54         100
131072               1369.65           1369.19           1370.40         100
262144               2080.14           2079.76           2080.74         100
524288               3534.73           3534.31           3535.26         100
1048576              6413.02           6412.56           6413.62         100
```
- nccl perftest
```
# nThread 1 nGpus 1 minBytes 1 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  10094 on dgx1v-loki-23 device  0 [0x06] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid  10095 on dgx1v-loki-23 device  1 [0x07] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid  10096 on dgx1v-loki-23 device  2 [0x0a] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid  10097 on dgx1v-loki-23 device  3 [0x0b] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid  10098 on dgx1v-loki-23 device  4 [0x85] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid  10099 on dgx1v-loki-23 device  5 [0x86] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid  10100 on dgx1v-loki-23 device  6 [0x89] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid  10105 on dgx1v-loki-23 device  7 [0x8a] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.22    0.00    0.00      0     0.19    0.00    0.00      0
           0             0     float    none      -1     0.20    0.00    0.00      0     0.18    0.00    0.00      0
           0             0     float    none      -1     0.19    0.00    0.00      0     0.18    0.00    0.00      0
           0             0     float    none      -1     0.18    0.00    0.00      0     0.18    0.00    0.00      0
           0             0     float    none      -1     0.18    0.00    0.00      0     0.18    0.00    0.00      0
           0             0     float    none      -1     0.18    0.00    0.00      0     0.18    0.00    0.00      0
           0             0     float    none      -1     0.18    0.00    0.00      0     0.18    0.00    0.00      0
         128             4     float    none      -1    11.69    0.01    0.01      0    11.77    0.01    0.01      0
         256             8     float    none      -1    11.79    0.02    0.02      0    11.84    0.02    0.02      0
         512            16     float    none      -1    12.02    0.04    0.04      0    12.09    0.04    0.04      0
        1024            32     float    none      -1    12.32    0.08    0.07      0    12.05    0.08    0.07      0
        2048            64     float    none      -1    12.76    0.16    0.14      0    12.40    0.17    0.14      0
        4096           128     float    none      -1    13.35    0.31    0.27      0    12.23    0.33    0.29      0
        8192           256     float    none      -1    13.30    0.62    0.54      0    12.96    0.63    0.55      0
       16384           512     float    none      -1    16.86    0.97    0.85      0    15.62    1.05    0.92      0
       32768          1024     float    none      -1    22.43    1.46    1.28      0    20.76    1.58    1.38      0
       65536          2048     float    none      -1    22.90    2.86    2.50      0    21.84    3.00    2.63      0
      131072          4096     float    none      -1    23.91    5.48    4.80      0    22.11    5.93    5.19      0
      262144          8192     float    none      -1    23.87   10.98    9.61      0    22.28   11.76   10.29      0
      524288         16384     float    none      -1    29.28   17.91   15.67      0    28.46   18.42   16.12      0
     1048576         32768     float    none      -1    43.12   24.32   21.28      0    40.64   25.80   22.57      0
```

# reproducer

### osu-benchmarks 

```bash
docker run \
--rm --net=host --uts=host --ipc=host \
--ulimit stack=67108864 --ulimit \
memlock=-1 --security-opt seccomp=unconfined \
--cap-add=SYS_ADMIN --cap-add=SYS_PTRACE \
--privileged \
--device=/dev/infiniband \
--gpus all \
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
/bin/bash -c '
apt-get update
apt-get install -y automake autoconf

# install osu-benchmark
git clone https://github.com/forresti/osu-micro-benchmarks.git
cd osu-micro-benchmarks
autoreconf -f -i
./configure --enable-cuda  --with-cuda-include=/usr/local/cuda/include --with-cuda-libpath=/usr/local/cuda/lib64
make -j
make -j install
cd ..

#install nccl-test
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make CUDA_HOME=/usr/local/cuda NCCL_HOME=/lib/x86_64-linux-gnu MPI=1 MPI_HOME=/usr/local/mpi
cd ..



# run osu-benchmark test
mpirun \
  -np 8 \
  --mca coll_ucc_enable 1 \
  --mca coll_hcoll_enable 0 \
  --mca coll_ucc_priority 100 \
  osu-micro-benchmarks/mpi/collective/osu_iallgather \
  -d cuda -f

mpirun \
  -np 8 \
  --mca coll_ucc_enable 1 \
  --mca coll_hcoll_enable 0 \
  --mca coll_ucc_priority 100 \
  osu-micro-benchmarks/mpi/collective/osu_allgather \
    -d cuda -f

#run nccl-test
mpirun \
  -np 8 \
  nccl-tests/build/all_gather_perf \
    -b 1 -e 1048576 -f 2

# run ucc perftest with TL/CUDA
mpirun \
  -np 8 \
  /opt/hpcx/ucc/bin/ucc_perftest \
     -c allgather -m cuda -T -F -b 1 -e 1048576

# run ucc perftest with TL/NCCL
mpirun \
  -np 8 \
  -x UCC_TL_NCCL_TUNE=inf \
  /opt/hpcx/ucc/bin/ucc_perftest \
     -c allgather -m cuda -T -F -b 1 -e 1048576
'
```


# nvFuser Overlap benchmark

nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's 
- nccl's latency 4.86517 ms
- ucc's latency 263.535 ms.
- 
### reproducer:
```bash
docker run \
--rm --net=host --uts=host --ipc=host \
--ulimit stack=67108864 --ulimit \
memlock=-1 --security-opt seccomp=unconfined \
--cap-add=SYS_ADMIN --cap-add=SYS_PTRACE \
--privileged \
--device=/dev/infiniband \
--gpus all \
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
/bin/bash -c '
git clone https://github.com/samnordmann/Fuser.git
cd Fuser
git checkout origin/overlap_bench/first_experiments
git submodule sync --recursive
git submodule update --init --recursive;

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

python setup.py --build-with-ucc --no-benchmark --no-python develop
mpirun --allow-run-as-root -x UCC_COLL_TRACE=info -x UCC_CL_BASIC_TLS=^mlx5 -np 8 $BUILD_DIRECTORY/test_multidevice --gtest_filter=OverlapBenchmark.DummyBenchmark/*_S1_M32768_K1024_N1024_Streams8
'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation on TL/CUDA/ALLGATHER #1042

On H100

osu benchmark's `osu_iallgather`

osu benchmark osu_allgather

nccl-test

ucc perftest

On V100

reproducer

osu-benchmarks

nvFuser Overlap benchmark

reproducer:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance degradation on TL/CUDA/ALLGATHER #1042

Description

On H100

osu benchmark's osu_iallgather

osu benchmark osu_allgather

nccl-test

ucc perftest

On V100

reproducer

osu-benchmarks

nvFuser Overlap benchmark

reproducer:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

osu benchmark's `osu_iallgather`