Skip to content

Conversation

@ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@kawashima-fj @dholmes-epcc-ed-ac-uk @bosilca
this PR integrates the persistent collective framework of #2758 , and the persistent collectives are implemented by coll/libnbc i minimally extended for that purpose based on ideas from #4515

i still need to finalize a few things

  • MPI_Startall()
  • some optimizations are no more valid with persistent collectives (for example, Allgather on one proc is a simple copy and return ompi_request_empty, this is fine for non blocking, but bogus on persistent)

@kawashima-fj
Copy link
Member

@ggouaillardet @dholmes-epcc-ed-ac-uk @bosilca I've added commits to implement MPI_STARTALL and fix the optimization issue in this PR branch.

The version of COLL interface is not updated yet. I have questions.

  • We need to update all mca_coll_base_module_2_2_0_t to mca_coll_base_module_2_3_0_t since coll_*_init members are added to it. Correct? These fields are NULLed in coll_base_module_construct so that other COLL components work correctly without persistent collective functions.
  • We don't need to update MCA_COLL_BASE_VERSION_2_0_0 to MCA_COLL_BASE_VERSION_2_3_0 (or anything) since it represents a component version, not module version. Correct?

After the version update, there are two remaining works.

  • MPI_ interfaces. Currently only C bindings with MPIX_ prefix are available in this PR. I believe the persistent collective proposal in Adding Persistent Collective Communication mpi-forum/mpi-issues#25 will pass soon. After that, I'll move to MPI_ prefix and add Fortran bindings.
  • Test. I've tested basic functionalities but they are not enough. I'll continue the test.

@dholmes-epcc-ed-ac-uk This PR is based on your work and your idea. Thank you very much. I think this PR is almost ready to be merged into Open MPI. So your #4515 can be replaced with this PR. Do you have any comments?

if (OPAL_UNLIKELY(OMPI_SUCCESS != res)) {
return res;
}
handle->super.req_state = OMPI_REQUEST_ACTIVE;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible inconsistency? The state should be set to ACTIVE before starting the first round, so that no work can be done for this request while it is INACTIVE (or PENDING).

The state transition sequence here should be:
INACTIVE -> ACTIVE -> COMPLETE -> INACTIVE

  1. MPI_Start goes from INACTIVE to ACTIVE then
  2. finishing the last round of the schedule goes from ACTIVE to COMPLETE then
  3. MPI_Wait goes from COMPLETE to INACTIVE

As there are two separate boolean variables we should have:
INACTIVE && COMPLETE -> ACTIVE && !COMPLETE -> ACTIVE && COMPLETE -> INACTIVE && COMPLETE

Incidentally, why are there 3 separate boolean state variables (8 possible states) to represent a total of 3 valid states?

        handle->super.req_complete = { REQUEST_PENDING || REQUEST_COMPLETED };
        handle->super.req_state = { OMPI_REQUEST_ACTIVE || OMPI_REQUEST_INACTIVE };
        handle->nbc_complete = { false || true }; // possible synonym for handle->super.req_complete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholmes-epcc-ed-ac-uk Thanks. Regarding ACTIVE, I'll update the PR next week. Regarding nbc_complete, the commit message 4866c82 explains the necessity. I need the status information before req_state or req_complete is updated because NBC_Progress is called recursively. I know it is not a smart solution.

int ompi_coll_libnbc_alltoall_init (const void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount,
MPI_Datatype recvtype, struct ompi_communicator_t *comm, MPI_Info info, ompi_request_t ** request,
struct mca_coll_base_module_2_2_0_t *module) {
int res = nbc_ialltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delegation is, conceptually, the wrong way around. The initialisation functions for persistent operations do less work than the nonblocking functions. Thus, nbc_ialltoall could delegate to alltoall_init (and then immediately call nbc_start) but this way around can/will cause problems.

In this particular example (I have not checked elsewhere), the nbc_ialltoall function actually performs the memcpy or MPI_Pack that moves the user data, i.e. it (correctly for nonblocking) does both initialisation and start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholmes-epcc-ed-ac-uk Thanks. I replaced NBC_Copy with NBC_Sched_copy in 373af2f but memcpy and PMPI_Pack were left in nbc_ialltoall.c unintentionally. I'll update the PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intention for this function is to be the common initialisation routine for persistent and nonblocking all-to-all, then it would be better to name it nbc_alltoall_init (rather than nbc_ialltoall). Future programmers will be less easily confused by that name, and less likely to (re-)introduce data-movement code into the initialisation-only step. Also, the nonblocking code path will then read correctly (initialise via nbc_alltoall_init then start via NBC_Start).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a point. I'll fix it.

@kawashima-fj kawashima-fj force-pushed the topic/pcoll branch 2 times, most recently from 14e3c1f to 5a80703 Compare May 22, 2018 05:49
@kawashima-fj
Copy link
Member

All my tests have completed and bugs are fixed now.

My planned works before OMPI v4.0:

  • Write man pages
    • Put the source files as ompi/mpiext/pcollreq/c/MPIX_*.3in
    • Write that current implementation is based on the draft standard and it may change
  • Update README
    • Write that persistent collective operations are available with MPIX_ prefix but the interface/behavior may be changed in the future

Works after standardization in the MPI Forum:

  • Change the MPIX_ prefix to MPI_ prefix
  • Move codes to ompi/mpi from ompi/mpiext
  • Add Fortran bindings (don't need to wait standardization)

@open-mpi open-mpi deleted a comment from ibm-ompi May 22, 2018
ggouaillardet and others added 6 commits June 11, 2018 09:53
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
…request() - set schedule in NBC_Schedule_request instead of NBC_Start() - update NBC_Start() prototype

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
prepare the upcoming persistent collectives by pre-factoring some code

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

fixup 808c3c62cd9475edd91ecde9d2d53b12e28b2c04
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Now libnbc COLL supports persistent collectives and all `*_init`
functions of the COLL interface are available. So let's enable the
check of availability of those functions on a communicator creation.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
@kawashima-fj
Copy link
Member

I've completed all works. I'll merge this next week. If you are interested in this feature, please review.

Some notes:

  • man pages are available for all MPIX_*_init functions. MPIX_Barrier_init.3 has contents for all functions and all other man pages redirect you to MPI_Barrier_init.3.
  • README contains a short explanation of this MPI extension.
  • Fortran bindings are not available now.
  • The COLL module interface version is updated to 2.3.0.

@kawashima-fj kawashima-fj added this to the v4.0.0 milestone Jun 11, 2018
kawashima-fj and others added 3 commits June 11, 2018 17:22
Until the MPI Forum decides to add the persistent collective
communication request feature to the MPI Standard, these functions
are supported through MPI extensions with the `MPIX_` prefix.

Only C bindings are supported currently.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
and errors otherwise.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Because a persistent reuqest does not free its `schedule` object
when the communication completes, the `NBC_Progress` function cannot
determine the completion using `schedule`.

Without this change, a hang occurs when the `NBC_Progress` function
is called recursively through the `NBC_Start_round` function.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`NBC_Copy` shoud not be called in `MPI_*_INIT`.
`NBC_Sched_copy` should be called instead.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Persistent operation for `NBC_A2A_DISS` is not supported currently.
Though the algorithm is not selected at all currently, I put an
assertion not to select it by mistake.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The `nbc_i*` functions don't start communication, but create a request.
`nbc_*_init` are appropriate names for them.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Members for persistent operations are added to the module structure
in a prior commit.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants