Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality Aware Broadcast #185

Merged
merged 7 commits into from
Dec 6, 2021
Merged

Locality Aware Broadcast #185

merged 7 commits into from
Dec 6, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Nov 26, 2021

In this PR I change the broadcast algorithm in MPI to reduce the number of cross-host messages sent. With the new implementation, the number of cross-host messages grows linearly with the number of hosts, and not with the number of ranks.

In the process I also change some minor issues with the broadcast implementation:

  1. The external API was not symmetric to reduce (and to the other collective communication primitves): faabric expected all ranks to call reduce but only the root rank to call broadcast (all other ranks would call recv), see here for the implementation in faasm. Now all ranks call reduce, and the behaviour depends on the caller's rank.
  2. There was no MPIMessage::BROADCAST in the protobuf object, so MPI_Bcast messages were labeled as NORMAL messages.
  3. I change the nomenclature in the method to use rootRank for the rank which originates the broadcast.

Regarding the algorithm itself, I introduce the notion of localMasters in the MPI world. A local master is a selected leader for all MPI ranks living in a particular host. Then:

  • The node performing a broadcast will message (1) all its local ranks, and (2) all the remote masters.
  • A remote master will (1) listen for the message from the root, and (2) message all its local ranks.
  • A non-master rank will listen from a message from either (1) the root or (2) its local master, depending on wether the root and itself are colocated or not.

For the moment local masters are the lowest rank in each host. Even though the load in local masters will now slightly increase, we are drastically reducing the load in rank 0 (which was always elected as the master, and therefore was doing a bunch of cross-host messaging).

To ease with testing I enhance the mocking tools in the MpiWorld.

Also note that this PR is rebased on top of #187 for faster GHA tests, so I'd recommend reviewing #187 first.

@csegarragonz csegarragonz self-assigned this Nov 26, 2021
@csegarragonz csegarragonz marked this pull request as ready for review November 26, 2021 15:23
int MpiWorld::getMasterForHost(const std::string& host)
{
std::vector<int> ranks = getRanksForHost(host);
assert(!ranks.empty());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only call this method internally, with values from the host list that we have been broadcasted, and that necessarily must match to a rank (otherwise it wouldn't have been sent in the first place).
Thus why we assert here insted of throwing an exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a useful comment in the code rather than in the PR.

@csegarragonz csegarragonz force-pushed the opt-bcast-reduce branch 2 times, most recently from 62d04b3 to cdd122a Compare December 1, 2021 12:14
.github/workflows/tests.yml Outdated Show resolved Hide resolved
include/faabric/scheduler/MpiWorld.h Outdated Show resolved Hide resolved
src/mpi_native/mpi_native.cpp Outdated Show resolved Hide resolved
include/faabric/scheduler/MpiWorld.h Show resolved Hide resolved
src/mpi_native/mpi_native.cpp Show resolved Hide resolved
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
tests/test/scheduler/test_remote_mpi_worlds.cpp Outdated Show resolved Hide resolved
tests/test/scheduler/test_remote_mpi_worlds.cpp Outdated Show resolved Hide resolved
@csegarragonz csegarragonz merged commit 4eef572 into master Dec 6, 2021
@csegarragonz csegarragonz deleted the opt-bcast-reduce branch December 6, 2021 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants