-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/acoll: Add support for MPI_Alltoall() #13046
base: main
Are you sure you want to change the base?
coll/acoll: Add support for MPI_Alltoall() #13046
Conversation
81c368e
to
c117925
Compare
@edgargabriel @lrbison Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern is that I don't think this will be correct for derived data types, and I didn't see a catch to check if the buffers are on a GPU (maybe that's at the acoll module level?)
I struggled with these issues myself in my alltoallv implementation, and made what I consider to be a very exhaustive validation tester. It was trivial to extend it from alltoallv to also test alltoall. Would you mind giving it a try? open-mpi/ompi-tests-public#32
Let me also do some testing, in particular I'm curious what would happen on a graviton layout, that can have as much as 96 cores with the same L3 distance.
if (comm_size <= 8) { | ||
if (total_dsize <= 128) { | ||
(*sync_enable) = true; | ||
} else { | ||
(*sync_enable) = false; | ||
} | ||
(*split_factor) = 2; | ||
} else if (comm_size <= 16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ripe for tuning file in the future...
I tried my validator as follows and I see a validation error:
Looks like I can trigger it on specific tests including If you dig into the verbose prints you can see:
The "Crossing 0 x 1" refers to send type and receive type respectively in this switch statement |
Hi @lrbison, thanks for sharing the tests. We are testing/validating this code using the tests shared and will soon raise an update fixing the issues/addressing the comments. |
(*sync_enable) = false; | ||
(*split_factor) = 2; | ||
} else { | ||
(*sync_enable) = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have assumed that the ring algorithm would be the best choice for large communicators and most of the data sizes. How di d you build this decision function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On our hardware, we observed the linear variants to perform better for lower node counts. However for higher node counts, it still needs to be tuned, and at that point this decision function will be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MithunMohanKadavil Rather than requiring a code-update to update your customer's tuning, would you take a look at this PR and see if it has the required features to support a tuning file format for acoll? I haven't taken a look at exactly what tuning tree you use, but if you could sketch out how you would lay out the file structure, maybe we can make it compatible with tuned and move towards a generic tuning file format with at least the outer structure shared across multiple collective components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lrbison , the framework in the PR does have the features required to support the tuning file. However we are yet to decide on the file structure, in the sense that what parameters will be tunable and how it can be applied across the algorithms in acoll. Additionally more tunable parameters may come up for us and as such the tuning file structure still needs to be vetted out from our side.
} | ||
|
||
error_handler: | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done to avoid having multiple return statements in the function. However since there was no cleanup involved, had to leave it empty.
|
||
/* Create a new datatype that iterates over the send buffer in strides | ||
* of ps_grp_size * scount. */ | ||
struct ompi_datatype_t *new_ddt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating datatype for each send is a terrible approach. In order to overcome the cost of creating and committing the datatype one would need very, very large buffers.
- However, if I understand the code correctly you are recreating the same datatype but with a different lower bound, so you could take advantage of that to create the indexed type once and then just change it's lower bound. That will same you a significant amount of time.
- As
ps_grp_num_ranks
is a constant during this function, how do you deal with the case where the number of processes is divisible by your factor ? Do you handle it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bosilca , yeah the current approach i have used is sub optimal. The only change across types is the dipls array, based on the read_pos. I am not aware of how the displs is mapped to the lower bound and it would be very helpful if you could share an example or API that can be used to manipulate the lower bound of an already created data type.
As for point 2, if the comm_size is a perfect multiple of split factor, then each rank would contain data for split factor numbers of adjacent ranks (including its own) times ps_grp_num_ranks. However in case comm_size is not a multiple of split factor, then the original split factor itself would have been modified such that comm_size is expressed as (new_comm_size * split_factor) + 1, and the data for the final rank will be handled in the function mca_coll_acoll_last_rank_scatter_gather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ompi_datatype_create_resized
is what you're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will check this and rework the code.
-A new parallel-split algorithm for MPI_Alltoall is introduced as part of acoll collective component, primarily targeting smaller message sizes (<= 4KB). The algorithm, at a high level, operates by diving the ranks into n groups, performing alltoall (using a base alltoall routine) within the n groups in parallel, following which data is exchanged between groups of n adjacent ranks (starting from rank 0). For example if n=2, this algorithm splits the ranks into 2 groups, one containing all even ranked processes and another containing all odd ranked processes. Alltoall is performed within these 2 groups in parallel, followed by which each adjacent even-odd pairs (pairs being [0,1], [2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8, alltoall is performed within 4 or 8 groups in parallel. Following this step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data among themselves to complete the alltoall operation. Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
c117925
to
d25450f
Compare
bot:retest |
This PR adds support for MPI_Alltoall() in acoll collective component.
For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.

The below diagram captures this algorithm for the case where f=2 and n=8:
For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.
The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:

