Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/acoll: Add support for MPI_Alltoall() #13046

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MithunMohanKadavil
Copy link
Contributor

This PR adds support for MPI_Alltoall() in acoll collective component.

For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.
The below diagram captures this algorithm for the case where f=2 and n=8:
image
For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.

The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:
image
image

@MithunMohanKadavil
Copy link
Contributor Author

@edgargabriel @lrbison Please review.

@janjust janjust requested a review from lrbison February 11, 2025 16:11
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is that I don't think this will be correct for derived data types, and I didn't see a catch to check if the buffers are on a GPU (maybe that's at the acoll module level?)

I struggled with these issues myself in my alltoallv implementation, and made what I consider to be a very exhaustive validation tester. It was trivial to extend it from alltoallv to also test alltoall. Would you mind giving it a try? open-mpi/ompi-tests-public#32

Let me also do some testing, in particular I'm curious what would happen on a graviton layout, that can have as much as 96 cores with the same L3 distance.

Comment on lines +48 to +55
if (comm_size <= 8) {
if (total_dsize <= 128) {
(*sync_enable) = true;
} else {
(*sync_enable) = false;
}
(*split_factor) = 2;
} else if (comm_size <= 16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ripe for tuning file in the future...

@lrbison
Copy link
Contributor

lrbison commented Feb 14, 2025

I tried my validator as follows and I see a validation error:

mpirun --mca coll_acoll_priority 100 -n 4 ./src/alltoallv_ddt -A alltoall -v 2

Looks like I can trigger it on specific tests including --only 2,2, --only 2,3, and --only 2,4 all showing failures.

If you dig into the verbose prints you can see:

--- Starting test 2,2.  Crossing 0 x 1
Created span from 0:4.  Data from 0:4
Created span from 0:48.  Data from 0:48
Datatype      (send,recv) extents (4,48), size (4,48), and lb (0,0)
Datatype TRUE (send,recv) extents (4,48), size (4,48), and lb (0,0)
<nasty assertion fail abort ... >

The "Crossing 0 x 1" refers to send type and receive type respectively in this switch statement
So test 2,2 sends 12 4-byte MPI_INTs, and receives 1 DDT which is a MPI_Type_contiguous vector of 12 integers. The following tests make more and more strange DDTs for checking.

@MithunMohanKadavil
Copy link
Contributor Author

Hi @lrbison, thanks for sharing the tests. We are testing/validating this code using the tests shared and will soon raise an update fixing the issues/addressing the comments.

(*sync_enable) = false;
(*split_factor) = 2;
} else {
(*sync_enable) = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have assumed that the ring algorithm would be the best choice for large communicators and most of the data sizes. How di d you build this decision function ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On our hardware, we observed the linear variants to perform better for lower node counts. However for higher node counts, it still needs to be tuned, and at that point this decision function will be updated.

Copy link
Contributor

@lrbison lrbison Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MithunMohanKadavil Rather than requiring a code-update to update your customer's tuning, would you take a look at this PR and see if it has the required features to support a tuning file format for acoll? I haven't taken a look at exactly what tuning tree you use, but if you could sketch out how you would lay out the file structure, maybe we can make it compatible with tuned and move towards a generic tuning file format with at least the outer structure shared across multiple collective components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lrbison , the framework in the PR does have the features required to support the tuning file. However we are yet to decide on the file structure, in the sense that what parameters will be tunable and how it can be applied across the algorithms in acoll. Additionally more tunable parameters may come up for us and as such the tuning file structure still needs to be vetted out from our side.

}

error_handler:
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done to avoid having multiple return statements in the function. However since there was no cleanup involved, had to leave it empty.


/* Create a new datatype that iterates over the send buffer in strides
* of ps_grp_size * scount. */
struct ompi_datatype_t *new_ddt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating datatype for each send is a terrible approach. In order to overcome the cost of creating and committing the datatype one would need very, very large buffers.

  1. However, if I understand the code correctly you are recreating the same datatype but with a different lower bound, so you could take advantage of that to create the indexed type once and then just change it's lower bound. That will same you a significant amount of time.
  2. As ps_grp_num_ranks is a constant during this function, how do you deal with the case where the number of processes is divisible by your factor ? Do you handle it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bosilca , yeah the current approach i have used is sub optimal. The only change across types is the dipls array, based on the read_pos. I am not aware of how the displs is mapped to the lower bound and it would be very helpful if you could share an example or API that can be used to manipulate the lower bound of an already created data type.

As for point 2, if the comm_size is a perfect multiple of split factor, then each rank would contain data for split factor numbers of adjacent ranks (including its own) times ps_grp_num_ranks. However in case comm_size is not a multiple of split factor, then the original split factor itself would have been modified such that comm_size is expressed as (new_comm_size * split_factor) + 1, and the data for the final rank will be handled in the function mca_coll_acoll_last_rank_scatter_gather.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ompi_datatype_create_resized is what you're looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will check this and rework the code.

-A new parallel-split algorithm for MPI_Alltoall is introduced as part
of acoll collective component, primarily targeting smaller message sizes
(<= 4KB). The algorithm, at a high level, operates by diving the ranks
into n groups, performing alltoall (using a base alltoall routine)
within the n groups in parallel, following which data is exchanged
between groups of n adjacent ranks (starting from rank 0). For example
if n=2, this algorithm splits the ranks into 2 groups, one containing
all even ranked processes and another containing all odd ranked
processes. Alltoall is performed within these 2 groups in parallel,
followed by which each adjacent even-odd pairs (pairs being [0,1],
[2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8,
alltoall is performed within 4 or 8 groups in parallel. Following this
step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data
among themselves to complete the alltoall operation.

Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
@MithunMohanKadavil
Copy link
Contributor Author

bot:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants