-
Notifications
You must be signed in to change notification settings - Fork 928
persistent collectives #4618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persistent collectives #4618
Conversation
f793596 to
a125778
Compare
|
@kawashima-fj @dholmes-epcc-ed-ac-uk @bosilca i still need to finalize a few things
|
a125778 to
ce88cad
Compare
|
@ggouaillardet @dholmes-epcc-ed-ac-uk @bosilca I've added commits to implement The version of COLL interface is not updated yet. I have questions.
After the version update, there are two remaining works.
@dholmes-epcc-ed-ac-uk This PR is based on your work and your idea. Thank you very much. I think this PR is almost ready to be merged into Open MPI. So your #4515 can be replaced with this PR. Do you have any comments? |
ompi/mca/coll/libnbc/nbc.c
Outdated
| if (OPAL_UNLIKELY(OMPI_SUCCESS != res)) { | ||
| return res; | ||
| } | ||
| handle->super.req_state = OMPI_REQUEST_ACTIVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible inconsistency? The state should be set to ACTIVE before starting the first round, so that no work can be done for this request while it is INACTIVE (or PENDING).
The state transition sequence here should be:
INACTIVE -> ACTIVE -> COMPLETE -> INACTIVE
- MPI_Start goes from INACTIVE to ACTIVE then
- finishing the last round of the schedule goes from ACTIVE to COMPLETE then
- MPI_Wait goes from COMPLETE to INACTIVE
As there are two separate boolean variables we should have:
INACTIVE && COMPLETE -> ACTIVE && !COMPLETE -> ACTIVE && COMPLETE -> INACTIVE && COMPLETE
Incidentally, why are there 3 separate boolean state variables (8 possible states) to represent a total of 3 valid states?
handle->super.req_complete = { REQUEST_PENDING || REQUEST_COMPLETED };
handle->super.req_state = { OMPI_REQUEST_ACTIVE || OMPI_REQUEST_INACTIVE };
handle->nbc_complete = { false || true }; // possible synonym for handle->super.req_complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dholmes-epcc-ed-ac-uk Thanks. Regarding ACTIVE, I'll update the PR next week. Regarding nbc_complete, the commit message 4866c82 explains the necessity. I need the status information before req_state or req_complete is updated because NBC_Progress is called recursively. I know it is not a smart solution.
ompi/mca/coll/libnbc/nbc_ialltoall.c
Outdated
| int ompi_coll_libnbc_alltoall_init (const void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, | ||
| MPI_Datatype recvtype, struct ompi_communicator_t *comm, MPI_Info info, ompi_request_t ** request, | ||
| struct mca_coll_base_module_2_2_0_t *module) { | ||
| int res = nbc_ialltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This delegation is, conceptually, the wrong way around. The initialisation functions for persistent operations do less work than the nonblocking functions. Thus, nbc_ialltoall could delegate to alltoall_init (and then immediately call nbc_start) but this way around can/will cause problems.
In this particular example (I have not checked elsewhere), the nbc_ialltoall function actually performs the memcpy or MPI_Pack that moves the user data, i.e. it (correctly for nonblocking) does both initialisation and start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dholmes-epcc-ed-ac-uk Thanks. I replaced NBC_Copy with NBC_Sched_copy in 373af2f but memcpy and PMPI_Pack were left in nbc_ialltoall.c unintentionally. I'll update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intention for this function is to be the common initialisation routine for persistent and nonblocking all-to-all, then it would be better to name it nbc_alltoall_init (rather than nbc_ialltoall). Future programmers will be less easily confused by that name, and less likely to (re-)introduce data-movement code into the initialisation-only step. Also, the nonblocking code path will then read correctly (initialise via nbc_alltoall_init then start via NBC_Start).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a point. I'll fix it.
14e3c1f to
5a80703
Compare
|
All my tests have completed and bugs are fixed now. My planned works before OMPI v4.0:
Works after standardization in the MPI Forum:
|
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
…request() - set schedule in NBC_Schedule_request instead of NBC_Start() - update NBC_Start() prototype Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
prepare the upcoming persistent collectives by pre-factoring some code Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> fixup 808c3c62cd9475edd91ecde9d2d53b12e28b2c04
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Now libnbc COLL supports persistent collectives and all `*_init` functions of the COLL interface are available. So let's enable the check of availability of those functions on a communicator creation. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
5a80703 to
ec2d96a
Compare
|
I've completed all works. I'll merge this next week. If you are interested in this feature, please review. Some notes:
|
Until the MPI Forum decides to add the persistent collective communication request feature to the MPI Standard, these functions are supported through MPI extensions with the `MPIX_` prefix. Only C bindings are supported currently. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
and errors otherwise. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Because a persistent reuqest does not free its `schedule` object when the communication completes, the `NBC_Progress` function cannot determine the completion using `schedule`. Without this change, a hang occurs when the `NBC_Progress` function is called recursively through the `NBC_Start_round` function. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`NBC_Copy` shoud not be called in `MPI_*_INIT`. `NBC_Sched_copy` should be called instead. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Persistent operation for `NBC_A2A_DISS` is not supported currently. Though the algorithm is not selected at all currently, I put an assertion not to select it by mistake. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The `nbc_i*` functions don't start communication, but create a request. `nbc_*_init` are appropriate names for them. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Members for persistent operations are added to the module structure in a prior commit. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
ec2d96a to
3dce039
Compare
No description provided.