TL/UCP: Update topo aware ring algorithm by Juee14Desai · Pull Request #1288 · openucx/ucc

Juee14Desai · 2026-03-24T22:53:39Z

What

Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.

Why ?

The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.

How ?

Allgather:

Rewritten to derive ring rank, peer, and block indices from the cuda_ring topology pattern instead of flat team rank.
Each of the up to 8 rings transfers its own sub block slice concurrently.
Algorithm auto selected for CUDA memory >4KB when cuda_ring is available via dynamic score string in allgather.c.
Service allgather decoupled into dedicated service_allgather_ring_start/progress functions in tl_ucp_service_coll.c so internal collectives continue using the flat rank ring.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
1048576	4194304	1318.36	47.72	1107.44	56.81
2097152	8388608	2527.03	49.79	1355.50	92.83
4194304	16777216	4946.76	50.87	1738.13	144.79
8388608	33554432	9760.98	51.56	2051.20	245.38
16777216	67108864	19394.07	51.90	3189.02	315.66
33554432	134217728	38630.25	52.12	5770.31	348.90
67108864	268435456	77089.72	52.23	10886.80	369.85

Reduce_scatter:

Rewritten to use cuda_ring for multi ring topology aware transfers.
Each ring handles its own sub block slice with per ring GPU reductions via the executor before forwarding to the next peer.
Scratch buffer management simplified to a single ucc_mc_alloc/free per task lifetime.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
16777216	67108864	4500.15	13.98	9472.58	6.64
33554432	134217728	4724.07	26.64	6875.58	18.30
67108864	268435456	6985.83	36.02	7527.26	33.43
134217728	536870912	11992.22	41.97	8311.25	60.56
268435456	1073741824	22032.55	45.69	10223.47	98.46
536870912	2147483648	42097.26	47.82	14125.28	142.53
1073741824	4294967296	82387.74	48.87	23308.91	172.75

Allreduce:

Monolithic implementation that fuses reduce_scatter and allgather into a single task/progress function, avoiding schedule overhead.
Phase 0 receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result. Phase 1 performs an in-place ring allgather.
Tagged send/recv counters are reset at the phase transition.
Auto selected for CUDA memory >4KB via dynamic score string in allreduce.c.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
1048576	4194304	N/A	N/A	999.99	7.86
2097152	8388608	N/A	N/A	1095.41	14.36
4194304	16777216	N/A	N/A	1290.22	24.38
8388608	33554432	N/A	N/A	4255.08	14.79
16777216	67108864	N/A	N/A	921.83	136.50
33554432	134217728	N/A	N/A	1796.86	140.05
67108864	268435456	N/A	N/A	3513.99	143.23

Juee14Desai · 2026-03-24T23:03:48Z

/build

wfaderhold21 · 2026-03-24T23:13:30Z

@Juee14Desai @janjust I assume this will replace PR #1258 ?

janjust · 2026-03-24T23:15:34Z

We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed.

Replace the default ring allgather with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring transfers its own slice of each block, enabling concurrent data movement across multiple NVLink paths. - Algorithm auto selected for CUDA memory >4KB when cuda_ring is available; falls back to knomial otherwise. Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c and decouples the service allgather from the topo aware ring. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>

Replace the default ring reduce_scatter with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring handles its own sub block slice, with per ring GPU reductions via the executor before forwarding to the next peer. - Scratch buffer management simplified to a single mc_alloc/free per task lifetime (removed fragmentation logic). Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>

Add a new monolithic allreduce ring that fuses reduce_scatter and allgather into a single task using team->cuda_ring for topo aware multi ring transfers (up to 8 parallel rings). Algorithm changes: - each step receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result to the next ring peer. - in-place ring allgather distributes all fully reduced blocks across ranks. - Both process runs in one progress function, with tagged send/recv counters reset at the algo transition. - Auto selected for CUDA memory >4KB when cuda_ring is available. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>

Juee14Desai requested a review from Sergei-Lebedev March 24, 2026 22:53

Juee14Desai force-pushed the ucc-ring-algo branch from 6c2242e to 843043c Compare March 24, 2026 23:02

Juee14Desai force-pushed the ucc-ring-algo branch 2 times, most recently from d29d785 to b80e124 Compare March 31, 2026 05:47

Juee14Desai added 3 commits March 30, 2026 22:48

Juee14Desai force-pushed the ucc-ring-algo branch from b80e124 to 9472c37 Compare March 31, 2026 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/UCP: Update topo aware ring algorithm #1288

TL/UCP: Update topo aware ring algorithm #1288
Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Juee14Desai:ucc-ring-algo

Juee14Desai commented Mar 24, 2026

Uh oh!

Juee14Desai commented Mar 24, 2026

Uh oh!

wfaderhold21 commented Mar 24, 2026

Uh oh!

janjust commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Juee14Desai commented Mar 24, 2026

What

Why ?

How ?

Uh oh!

Juee14Desai commented Mar 24, 2026

Uh oh!

wfaderhold21 commented Mar 24, 2026

Uh oh!

janjust commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants