coll: extending circulant graph algorithm by hzhou · Pull Request #7710 · pmodels/mpich

hzhou · 2026-01-29T16:20:02Z

Pull Request Description

Extend the circulant graph algorithm to reduce and allgather.

Reduce is the reverse of the bcast
Allgather is the concurrent running of all-bcast - bcast with each process as root.

In this PR -

Refactor the bcast_circ_graph algorithm into 3 pieces
1. The generation of the circulant graph schedules
2. The queuing and dependency tracking for non-blocking requests from running the schedule
3. The bcast algorithm itself
The piece 2 is the most interesting part of this PR. The goal is to evolve it into a semi-general collective schedule framework that can perform
1. multi-stage async local staging/packing/unpacking for each send/recv
2. dependency tracking
3. concurrency limit control
4. generalized request abstraction
Bcast is the simplest. The recvs have no dependency. The send may depend on previous recv of the same block
Allgather extends the amount of buffers or block by the number of processes, but otherwise it is the same as bcast
Reduce -
1. the recv has two parts: recv into tmp_buf and reduce into recvbuf. The recv part need clear previous recv, but the reduce part need previous sends
2. the send need clear previous recv including the reduction
Reduce_scatter is an "all-" version of Reduce just as Allgather is an "all-" version of Bcast. However, since we cannot reduce into sendbuf, we need create local temp copy of sendbuf. Thus it makes the algorithm not appealing memory-wise.

Reference:

NOTES

The algorithm performs reduces out-of-order. This is problematic for floating point reduction. It may result in nondeterministic (from user's point of view) results.

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

The circulant graph algorithm can be extended to reduce, allgather, and allreduce. Refactor so we can share the algorithm code.

Before we extend the circ_graph algorithm to more collectives, e.g. reduce and allgather, refactor to prepare for the new code.

Remove the extra parameters chunk_size and q_len for the bcast circ_graph algorithm. Instead, use global cvar MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE and MPIR_CVAR_CIRC_GRAPH_Q_LEN to tune all circ_graph algorithms. Both the chunk_size and q_len have more to do with communication latency and bandwidth curve, and less to do with specific collective operations. Remove the extra parameters for now simplifies the effort to extend the circ_graph algorithm to more collectives such as reduce and allgather. We can add the parameters back in the future when it is shown to be necessary.

Instead of just a true/false, we can store the actual pending request index in the pending_blocks[] (replace can_send[]) to avoid a linear search every time a send block is pending.

Handle the non-contig datatype packing and unpacking in cga_request_queue. This paves way for later extend the cga_request_queue into nonblocking and be able to handle asynchronous GPU packing/unpacking. Also move the q_len and chunk_size handling into cga_request_queue.c.

Bcast zero-sized messages works with the circ_graph algorithm.

If we reverse the circulant graph bcast schedule, we get the reduce algorithm. We extend the cga_request_queue facility to perform reduction at the completion of receive requests. Unlike bcast, which only receives a block once, reduce receives the same block from multiple processes (and performs reduction), thus we need check for pending previous receives before issuing new ones.

Allgather is the same as all-bcast with every rank assuming as root. Compared to bcast, the buffers are aggregate buffers for comm_size processes.

Different collective types have very different dependency conditions in issuing sends and recvs. Split them into separate functions rather than having a big switch with a single function.

In bcast and allgather the dependency tracking is simple as recv does not have dependency and send only depend on at most a single recv. For reduction, we may have multiple pending sends and a single pending recv.

hzhou · 2026-02-06T19:21:43Z

test:mpich/ch4/most
test:mpich/ch3/most

hzhou force-pushed the 2601_coll_circ branch 3 times, most recently from 97aae10 to 30b90f1 Compare January 29, 2026 21:21

coll: refactor bcast_intra_circ_graph.c

b262b1b

The circulant graph algorithm can be extended to reduce, allgather, and allreduce. Refactor so we can share the algorithm code.

hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from 1f33ae0 to 2d41b24 Compare January 30, 2026 15:50

hzhou added 3 commits February 2, 2026 10:10

coll: refactor before adding new circ_graph algorithms

a5b7ba0

Before we extend the circ_graph algorithm to more collectives, e.g. reduce and allgather, refactor to prepare for the new code.

coll/circ_graph: store pending request to avoid search

b08622e

Instead of just a true/false, we can store the actual pending request index in the pending_blocks[] (replace can_send[]) to avoid a linear search every time a send block is pending.

hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from 02f4bca to 35f6086 Compare February 2, 2026 23:35

hzhou added 2 commits February 3, 2026 12:24

coll/circ_graph: allow bcast zero-sized messages

ab10aa8

Bcast zero-sized messages works with the circ_graph algorithm.

hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from a65f7b7 to 57ced10 Compare February 5, 2026 20:05

hzhou added 2 commits February 6, 2026 10:57

coll: add intra_circ_graph allgather algorithm

c761af1

Allgather is the same as all-bcast with every rank assuming as root. Compared to bcast, the buffers are aggregate buffers for comm_size processes.

hzhou force-pushed the 2601_coll_circ branch from 57ced10 to d9e1b20 Compare February 6, 2026 16:57

hzhou marked this pull request as ready for review February 6, 2026 16:57

hzhou added 2 commits February 6, 2026 12:53

coll/circ_graph: refactor send/recv interface

12bd463

Different collective types have very different dependency conditions in issuing sends and recvs. Split them into separate functions rather than having a big switch with a single function.

coll/circ_graph: refactor dependency tracking for reduce

7116eb8

In bcast and allgather the dependency tracking is simple as recv does not have dependency and send only depend on at most a single recv. For reduction, we may have multiple pending sends and a single pending recv.

hzhou force-pushed the 2601_coll_circ branch from d9e1b20 to 7116eb8 Compare February 6, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll: extending circulant graph algorithm#7710

coll: extending circulant graph algorithm#7710
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_coll_circ

hzhou commented Jan 29, 2026 •

edited

Loading

Uh oh!

hzhou commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hzhou commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

NOTES

Author Checklist

Uh oh!

hzhou commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hzhou commented Jan 29, 2026 •

edited

Loading