Skip to content

ulfm: implement ulfm that does not depend on PMI#7706

Open
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_ulfm
Open

ulfm: implement ulfm that does not depend on PMI#7706
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_ulfm

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 20, 2026

Pull Request Description

The current ULFM implementation is based on PMI/hydra, in particular, the PMI_dead_processes from PMI_Get. This limits the usage on systems that use an unsupported PMI.

In this new implementation, we start with a resilient implementation of MPIX_Comm_agree, which relies on using active message to probe processes in a binary tree fashion. We rely on the fact that an am_send to a dead process will result in a immediate failure.

This resilient MPIX_Comm_agree can provide an consensus on MPIX_Comm_get_failed and a simple re-implementation of MPIX_Comm_shrink.

[skip warnings]

Test Program

if (rank == 1) {
    exit(-1);
}
int flag = 0x1;
MPIX_Comm_agree(comm, &flag);

MPI_Group failed_group; int num_failed;
MPIX_Comm_get_failed(comm, &failed_group);
MPI_Group_size(failed_group, &num_failed);
if (num_failed > 0) {
    MPI_Comm new_comm;
    MPIX_Comm_shrink(comm, &new_comm);
    /* error checking */
    MPI_Comm_free(&comm);
    comm = new_comm;
}

Notes

  • We rely on a send_probe function that will fail if the target process is dead. In ch4, this relies on MPIDI_NM_am_isend. However -
    • it may fail right away by returning an error
    • it may succeed then set sreq->status.MPI_ERROR upon completion
    • it may succeed all the way but the target process dies after the probe
    • it may succeed with no feedback -- consider injection send
    • it may infinitely give EAGAIN in progress

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 2 commits January 20, 2026 11:12
We usually don't need link with static dependency with dynamic
libraries. Add pkgconfig dependency sometime cause build issues when the
dependency library paths are not in system library paths by default.
Thus, removing it brings less issues.

We may need add dependency libraries when building static libraries.
So TODO: add it back for static build only.
The process manager, e.g. hydra may still kill all the processes after
receiving the the abort message, but applications will actually get a
return before being killed. This can be confusing. Let's exit and never
return in PMI_Abort. PMI2_Abort already does that.
@hzhou hzhou force-pushed the 2601_ulfm branch 2 times, most recently from 2e6e40c to 7157400 Compare January 20, 2026 18:39
hzhou added 3 commits January 20, 2026 13:21
When user set finite MPIR_CVAR_CH4_OFI_MAX_EAGAIN_RETRY, we should check
so MPIDI_OFI_CALL_RETRY_AM don't end up in infinite loop.

For tcp provider at least, fi_send to a dead process result in infinite
FI_EAGAIN.
To allow ulfm to work, we need turn off fatal error codes.
Return error in progress will abort whichever calls that invoked
progress. Rather, we should return the error in request's status
whenever we can so the error can be handled in the proper context.

Do this for short am send for now. It is needed to use short active
message as a way to probe dead processes. And add a fixme that we need
apply this to other requests as well.
hzhou added 5 commits January 21, 2026 14:59
Add an algorithm that does not depend on PMI_Get "dead_processes".

This algorithm relies on a try_probe(rank) method that can return
MPIX_ERR_PROC_FAILED if rank is dead.

The algorithm first performs a reduction and then a broadcast. Both
uses active messages to allow dead processes.
MPID_Comm_agree will update a list of failed processes locally. This
provides an alternative implementation of ULFM that does not rely on
PMI_dead_processes.
If we assume user always call MPIX_Comm_shrink after MPIX_Comm_agree,
we can simply reduce MPIX_Comm_shrink to MPI_Comm_create_group.

There is always a possibility that user call MPIX_Comm_shrink without
reaching a consensus on failed processes or there may be new processes
fail during the call, however I argue, there is no robust way to handle
this situation anyway other than just return an error to user and let
user handle it.
Even when we first probe a process succeeded, the process still may die
before it enters MPIX_Comm_agree and send us the probe. Regularly retry
probe to prevent stuck in the waiting for a probe that never arrives.

Potentially we may stuck during the broadcast stage as well, but that
only mean the a process died *during* MPIX_Comm_agree. Hopefully, the
chances of that is low. Compared to the previous case, the process may
be doing arbitrary amount of work before entering MPIX_Comm_agree and may
die before then.
The probe may succeed to a dead rank then later we receive a probe from
a substitute rank from the peer group. Since we never sent this
substitute rank a probe, it may hang.

Always verify the origin rank for received probes and send a make-up
probe if it is from an unexpected rank.
@hzhou hzhou changed the title ulfm: implement ulfm that are independent of PMI ulfm: implement ulfm that does not depend on PMI Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant