Skip to content

TL/UCP: Update topo aware ring algorithm #1288

Open
Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Juee14Desai:ucc-ring-algo
Open

TL/UCP: Update topo aware ring algorithm #1288
Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Juee14Desai:ucc-ring-algo

Conversation

@Juee14Desai
Copy link
Copy Markdown
Collaborator

What

Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.

Why ?

The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.

How ?

Allgather:

  • Rewritten to derive ring rank, peer, and block indices from the cuda_ring topology pattern instead of flat team rank.
  • Each of the up to 8 rings transfers its own sub block slice concurrently.
  • Algorithm auto selected for CUDA memory >4KB when cuda_ring is available via dynamic score string in allgather.c.
  • Service allgather decoupled into dedicated service_allgather_ring_start/progress functions in tl_ucp_service_coll.c so internal collectives continue using the flat rank ring.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
1048576 4194304 1318.36 47.72 1107.44 56.81
2097152 8388608 2527.03 49.79 1355.50 92.83
4194304 16777216 4946.76 50.87 1738.13 144.79
8388608 33554432 9760.98 51.56 2051.20 245.38
16777216 67108864 19394.07 51.90 3189.02 315.66
33554432 134217728 38630.25 52.12 5770.31 348.90
67108864 268435456 77089.72 52.23 10886.80 369.85

Reduce_scatter:

  • Rewritten to use cuda_ring for multi ring topology aware transfers.
  • Each ring handles its own sub block slice with per ring GPU reductions via the executor before forwarding to the next peer.
  • Scratch buffer management simplified to a single ucc_mc_alloc/free per task lifetime.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
16777216 67108864 4500.15 13.98 9472.58 6.64
33554432 134217728 4724.07 26.64 6875.58 18.30
67108864 268435456 6985.83 36.02 7527.26 33.43
134217728 536870912 11992.22 41.97 8311.25 60.56
268435456 1073741824 22032.55 45.69 10223.47 98.46
536870912 2147483648 42097.26 47.82 14125.28 142.53
1073741824 4294967296 82387.74 48.87 23308.91 172.75

Allreduce:

  • Monolithic implementation that fuses reduce_scatter and allgather into a single task/progress function, avoiding schedule overhead.
  • Phase 0 receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result. Phase 1 performs an in-place ring allgather.
  • Tagged send/recv counters are reset at the phase transition.
  • Auto selected for CUDA memory >4KB via dynamic score string in allreduce.c.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
1048576 4194304 N/A N/A 999.99 7.86
2097152 8388608 N/A N/A 1095.41 14.36
4194304 16777216 N/A N/A 1290.22 24.38
8388608 33554432 N/A N/A 4255.08 14.79
16777216 67108864 N/A N/A 921.83 136.50
33554432 134217728 N/A N/A 1796.86 140.05
67108864 268435456 N/A N/A 3513.99 143.23

@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

@wfaderhold21
Copy link
Copy Markdown
Collaborator

@Juee14Desai @janjust I assume this will replace PR #1258 ?

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 24, 2026

We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed.

@Juee14Desai Juee14Desai force-pushed the ucc-ring-algo branch 2 times, most recently from d29d785 to b80e124 Compare March 31, 2026 05:47
Replace the default ring allgather with a topo aware
multi ring implementation that uses team->cuda_ring to route data
along NVLink optimal paths (up to 8 parallel rings).

Algorithm changes:
- Ring rank, peer, and block indices are now derived from the
  cuda_ring topology pattern instead of flat team rank ordering.
- Each ring transfers its own slice of each block, enabling
  concurrent data movement across multiple NVLink paths.
- Algorithm auto selected for CUDA memory >4KB when cuda_ring
  is available; falls back to knomial otherwise.

Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c
and decouples the service allgather from the topo aware ring.

Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Replace the default ring reduce_scatter with a topo aware
multi ring implementation that uses team->cuda_ring to route data
along NVLink optimal paths (up to 8 parallel rings).

Algorithm changes:
- Ring rank, peer, and block indices are now derived from the
  cuda_ring topology pattern instead of flat team rank ordering.
- Each ring handles its own sub block slice, with per ring GPU
  reductions via the executor before forwarding to the next peer.
- Scratch buffer management simplified to a single mc_alloc/free
  per task lifetime (removed fragmentation logic).

Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Add a new monolithic allreduce ring that fuses reduce_scatter and
allgather into a single task using team->cuda_ring for topo aware
multi ring transfers (up to 8 parallel rings).

Algorithm changes:
- each step receives into scratch, reduces with the local dst block
  via GPU executor, then forwards the accumulated result to the
  next ring peer.
- in-place ring allgather distributes all fully reduced blocks
  across ranks.
- Both process runs in one progress function, with tagged send/recv
  counters reset at the algo transition.
- Auto selected for CUDA memory >4KB when cuda_ring is available.

Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants