Skip to content

coll: extending circulant graph algorithm#7710

Open
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_coll_circ
Open

coll: extending circulant graph algorithm#7710
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_coll_circ

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 29, 2026

Pull Request Description

Extend the circulant graph algorithm to reduce and allgather.

  • Reduce is the reverse of the bcast
  • Allgather is the concurrent running of all-bcast - bcast with each process as root.

In this PR -

  • Refactor the bcast_circ_graph algorithm into 3 pieces

    1. The generation of the circulant graph schedules
    2. The queuing and dependency tracking for non-blocking requests from running the schedule
    3. The bcast algorithm itself
  • The piece 2 is the most interesting part of this PR. The goal is to evolve it into a semi-general collective schedule framework that can perform

    1. multi-stage async local staging/packing/unpacking for each send/recv
    2. dependency tracking
    3. concurrency limit control
    4. generalized request abstraction
  • Bcast is the simplest. The recvs have no dependency. The send may depend on previous recv of the same block

  • Allgather extends the amount of buffers or block by the number of processes, but otherwise it is the same as bcast

  • Reduce -

    1. the recv has two parts: recv into tmp_buf and reduce into recvbuf. The recv part need clear previous recv, but the reduce part need previous sends
    2. the send need clear previous recv including the reduction
  • Reduce_scatter is an "all-" version of Reduce just as Allgather is an "all-" version of Bcast. However, since we cannot reduce into sendbuf, we need create local temp copy of sendbuf. Thus it makes the algorithm not appealing memory-wise.

Reference:

NOTES

  • The algorithm performs reduces out-of-order. This is problematic for floating point reduction. It may result in nondeterministic (from user's point of view) results.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2601_coll_circ branch 3 times, most recently from 97aae10 to 30b90f1 Compare January 29, 2026 21:21
The circulant graph algorithm can be extended to reduce, allgather, and
allreduce. Refactor so we can share the algorithm code.
@hzhou hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from 1f33ae0 to 2d41b24 Compare January 30, 2026 15:50
hzhou added 3 commits February 2, 2026 10:10
Before we extend the circ_graph algorithm to more collectives, e.g.
reduce and allgather, refactor to prepare for the new code.
Remove the extra parameters chunk_size and q_len for the bcast
circ_graph algorithm. Instead, use global cvar
MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE and MPIR_CVAR_CIRC_GRAPH_Q_LEN to tune
all circ_graph algorithms. Both the chunk_size and q_len have more to do
with communication latency and bandwidth curve, and less to do with
specific collective operations. Remove the extra parameters for now
simplifies the effort to extend the circ_graph algorithm to more
collectives such as reduce and allgather. We can add the parameters back
in the future when it is shown to be necessary.
Instead of just a true/false, we can store the actual pending request
index in the pending_blocks[] (replace can_send[]) to avoid a linear
search every time a send block is pending.
@hzhou hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from 02f4bca to 35f6086 Compare February 2, 2026 23:35
hzhou added 2 commits February 3, 2026 12:24
Handle the non-contig datatype packing and unpacking in
cga_request_queue. This paves way for later extend the cga_request_queue
into nonblocking and be able to handle asynchronous GPU
packing/unpacking.

Also move the q_len and chunk_size handling into cga_request_queue.c.
Bcast zero-sized messages works with the circ_graph algorithm.
@hzhou hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from a65f7b7 to 57ced10 Compare February 5, 2026 20:05
hzhou added 2 commits February 6, 2026 10:57
If we reverse the circulant graph bcast schedule, we get the reduce
algorithm. We extend the cga_request_queue facility to perform reduction
at the completion of receive requests.

Unlike bcast, which only receives a block once, reduce receives the same
block from multiple processes (and performs reduction), thus we need
check for pending previous receives before issuing new ones.
Allgather is the same as all-bcast with every rank assuming as root.
Compared to bcast, the buffers are aggregate buffers for comm_size
processes.
@hzhou hzhou marked this pull request as ready for review February 6, 2026 16:57
hzhou added 2 commits February 6, 2026 12:53
Different collective types have very different dependency conditions in
issuing sends and recvs. Split them into separate functions rather than
having a big switch with a single function.
In bcast and allgather the dependency tracking is simple as recv does
not have dependency and send only depend on at most a single recv.
For reduction, we may have multiple pending sends and a single pending
recv.
@hzhou
Copy link
Contributor Author

hzhou commented Feb 6, 2026

test:mpich/ch4/most
test:mpich/ch3/most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant