coll: extending circulant graph algorithm#7710
Open
hzhou wants to merge 10 commits intopmodels:mainfrom
Open
Conversation
97aae10 to
30b90f1
Compare
The circulant graph algorithm can be extended to reduce, allgather, and allreduce. Refactor so we can share the algorithm code.
1f33ae0 to
2d41b24
Compare
Before we extend the circ_graph algorithm to more collectives, e.g. reduce and allgather, refactor to prepare for the new code.
Remove the extra parameters chunk_size and q_len for the bcast circ_graph algorithm. Instead, use global cvar MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE and MPIR_CVAR_CIRC_GRAPH_Q_LEN to tune all circ_graph algorithms. Both the chunk_size and q_len have more to do with communication latency and bandwidth curve, and less to do with specific collective operations. Remove the extra parameters for now simplifies the effort to extend the circ_graph algorithm to more collectives such as reduce and allgather. We can add the parameters back in the future when it is shown to be necessary.
Instead of just a true/false, we can store the actual pending request index in the pending_blocks[] (replace can_send[]) to avoid a linear search every time a send block is pending.
02f4bca to
35f6086
Compare
Handle the non-contig datatype packing and unpacking in cga_request_queue. This paves way for later extend the cga_request_queue into nonblocking and be able to handle asynchronous GPU packing/unpacking. Also move the q_len and chunk_size handling into cga_request_queue.c.
Bcast zero-sized messages works with the circ_graph algorithm.
a65f7b7 to
57ced10
Compare
If we reverse the circulant graph bcast schedule, we get the reduce algorithm. We extend the cga_request_queue facility to perform reduction at the completion of receive requests. Unlike bcast, which only receives a block once, reduce receives the same block from multiple processes (and performs reduction), thus we need check for pending previous receives before issuing new ones.
Allgather is the same as all-bcast with every rank assuming as root. Compared to bcast, the buffers are aggregate buffers for comm_size processes.
Different collective types have very different dependency conditions in issuing sends and recvs. Split them into separate functions rather than having a big switch with a single function.
In bcast and allgather the dependency tracking is simple as recv does not have dependency and send only depend on at most a single recv. For reduction, we may have multiple pending sends and a single pending recv.
Contributor
Author
|
test:mpich/ch4/most |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Extend the circulant graph algorithm to reduce and allgather.
In this PR -
Refactor the
bcast_circ_graphalgorithm into 3 piecesThe piece 2 is the most interesting part of this PR. The goal is to evolve it into a semi-general collective schedule framework that can perform
Bcast is the simplest. The
recvs have no dependency. Thesendmay depend on previousrecvof the same blockAllgather extends the amount of buffers or block by the number of processes, but otherwise it is the same as bcast
Reduce -
tmp_bufand reduce intorecvbuf. Therecvpart need clear previousrecv, but thereducepart need previoussendsrecvincluding the reductionReduce_scatter is an "all-" version of
Reducejust asAllgatheris an "all-" version ofBcast. However, since we cannot reduce intosendbuf, we need create local temp copy ofsendbuf. Thus it makes the algorithm not appealing memory-wise.Reference:
NOTES
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.