TL/UCP: Update topo aware ring algorithm #1288
Open
Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Open
TL/UCP: Update topo aware ring algorithm #1288Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Juee14Desai wants to merge 3 commits intoopenucx:masterfrom
Conversation
6c2242e to
843043c
Compare
Collaborator
Author
|
/build |
Collaborator
|
@Juee14Desai @janjust I assume this will replace PR #1258 ? |
Collaborator
|
We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed. |
d29d785 to
b80e124
Compare
Replace the default ring allgather with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring transfers its own slice of each block, enabling concurrent data movement across multiple NVLink paths. - Algorithm auto selected for CUDA memory >4KB when cuda_ring is available; falls back to knomial otherwise. Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c and decouples the service allgather from the topo aware ring. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Replace the default ring reduce_scatter with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring handles its own sub block slice, with per ring GPU reductions via the executor before forwarding to the next peer. - Scratch buffer management simplified to a single mc_alloc/free per task lifetime (removed fragmentation logic). Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Add a new monolithic allreduce ring that fuses reduce_scatter and allgather into a single task using team->cuda_ring for topo aware multi ring transfers (up to 8 parallel rings). Algorithm changes: - each step receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result to the next ring peer. - in-place ring allgather distributes all fully reduced blocks across ranks. - Both process runs in one progress function, with tagged send/recv counters reset at the algo transition. - Auto selected for CUDA memory >4KB when cuda_ring is available. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
b80e124 to
9472c37
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.
Why ?
The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.
How ?
Allgather:
Reduce_scatter:
Allreduce: