ulfm: implement ulfm that does not depend on PMI#7706
Open
hzhou wants to merge 10 commits intopmodels:mainfrom
Open
ulfm: implement ulfm that does not depend on PMI#7706hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou wants to merge 10 commits intopmodels:mainfrom
Conversation
We usually don't need link with static dependency with dynamic libraries. Add pkgconfig dependency sometime cause build issues when the dependency library paths are not in system library paths by default. Thus, removing it brings less issues. We may need add dependency libraries when building static libraries. So TODO: add it back for static build only.
The process manager, e.g. hydra may still kill all the processes after receiving the the abort message, but applications will actually get a return before being killed. This can be confusing. Let's exit and never return in PMI_Abort. PMI2_Abort already does that.
2e6e40c to
7157400
Compare
When user set finite MPIR_CVAR_CH4_OFI_MAX_EAGAIN_RETRY, we should check so MPIDI_OFI_CALL_RETRY_AM don't end up in infinite loop. For tcp provider at least, fi_send to a dead process result in infinite FI_EAGAIN.
To allow ulfm to work, we need turn off fatal error codes.
Return error in progress will abort whichever calls that invoked progress. Rather, we should return the error in request's status whenever we can so the error can be handled in the proper context. Do this for short am send for now. It is needed to use short active message as a way to probe dead processes. And add a fixme that we need apply this to other requests as well.
Add an algorithm that does not depend on PMI_Get "dead_processes". This algorithm relies on a try_probe(rank) method that can return MPIX_ERR_PROC_FAILED if rank is dead. The algorithm first performs a reduction and then a broadcast. Both uses active messages to allow dead processes.
MPID_Comm_agree will update a list of failed processes locally. This provides an alternative implementation of ULFM that does not rely on PMI_dead_processes.
If we assume user always call MPIX_Comm_shrink after MPIX_Comm_agree, we can simply reduce MPIX_Comm_shrink to MPI_Comm_create_group. There is always a possibility that user call MPIX_Comm_shrink without reaching a consensus on failed processes or there may be new processes fail during the call, however I argue, there is no robust way to handle this situation anyway other than just return an error to user and let user handle it.
Even when we first probe a process succeeded, the process still may die before it enters MPIX_Comm_agree and send us the probe. Regularly retry probe to prevent stuck in the waiting for a probe that never arrives. Potentially we may stuck during the broadcast stage as well, but that only mean the a process died *during* MPIX_Comm_agree. Hopefully, the chances of that is low. Compared to the previous case, the process may be doing arbitrary amount of work before entering MPIX_Comm_agree and may die before then.
The probe may succeed to a dead rank then later we receive a probe from a substitute rank from the peer group. Since we never sent this substitute rank a probe, it may hang. Always verify the origin rank for received probes and send a make-up probe if it is from an unexpected rank.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
The current ULFM implementation is based on PMI/hydra, in particular, the
PMI_dead_processesfromPMI_Get. This limits the usage on systems that use an unsupported PMI.In this new implementation, we start with a resilient implementation of
MPIX_Comm_agree, which relies on using active message to probe processes in a binary tree fashion. We rely on the fact that anam_sendto a dead process will result in a immediate failure.This resilient
MPIX_Comm_agreecan provide an consensus onMPIX_Comm_get_failedand a simple re-implementation ofMPIX_Comm_shrink.[skip warnings]
Test Program
Notes
send_probefunction that will fail if the target process is dead. In ch4, this relies onMPIDI_NM_am_isend. However -sreq->status.MPI_ERRORupon completionEAGAINin progressAuthor Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.