ulfm: implement ulfm that does not depend on PMI by hzhou · Pull Request #7706 · pmodels/mpich

hzhou · 2026-01-20T02:27:33Z

Pull Request Description

The current ULFM implementation is based on PMI/hydra, in particular, the PMI_dead_processes from PMI_Get. This limits the usage on systems that use an unsupported PMI.

In this new implementation, we start with a resilient implementation of MPIX_Comm_agree, which relies on using active message to probe processes in a binary tree fashion. We rely on the fact that an am_send to a dead process will result in a immediate failure.

This resilient MPIX_Comm_agree can provide an consensus on MPIX_Comm_get_failed and a simple re-implementation of MPIX_Comm_shrink.

[skip warnings]

Test Program

if (rank == 1) {
    exit(-1);
}
int flag = 0x1;
MPIX_Comm_agree(comm, &flag);

MPI_Group failed_group; int num_failed;
MPIX_Comm_get_failed(comm, &failed_group);
MPI_Group_size(failed_group, &num_failed);
if (num_failed > 0) {
    MPI_Comm new_comm;
    MPIX_Comm_shrink(comm, &new_comm);
    /* error checking */
    MPI_Comm_free(&comm);
    comm = new_comm;
}

Notes

We rely on a send_probe function that will fail if the target process is dead. In ch4, this relies on MPIDI_NM_am_isend. However -
- it may fail right away by returning an error
- it may succeed then set sreq->status.MPI_ERROR upon completion
- it may succeed all the way but the target process dies after the probe
- it may succeed with no feedback -- consider injection send
- it may infinitely give EAGAIN in progress

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

We usually don't need link with static dependency with dynamic libraries. Add pkgconfig dependency sometime cause build issues when the dependency library paths are not in system library paths by default. Thus, removing it brings less issues. We may need add dependency libraries when building static libraries. So TODO: add it back for static build only.

The process manager, e.g. hydra may still kill all the processes after receiving the the abort message, but applications will actually get a return before being killed. This can be confusing. Let's exit and never return in PMI_Abort. PMI2_Abort already does that.

When user set finite MPIR_CVAR_CH4_OFI_MAX_EAGAIN_RETRY, we should check so MPIDI_OFI_CALL_RETRY_AM don't end up in infinite loop. For tcp provider at least, fi_send to a dead process result in infinite FI_EAGAIN.

To allow ulfm to work, we need turn off fatal error codes.

Return error in progress will abort whichever calls that invoked progress. Rather, we should return the error in request's status whenever we can so the error can be handled in the proper context. Do this for short am send for now. It is needed to use short active message as a way to probe dead processes. And add a fixme that we need apply this to other requests as well.

Add an algorithm that does not depend on PMI_Get "dead_processes". This algorithm relies on a try_probe(rank) method that can return MPIX_ERR_PROC_FAILED if rank is dead. The algorithm first performs a reduction and then a broadcast. Both uses active messages to allow dead processes.

MPID_Comm_agree will update a list of failed processes locally. This provides an alternative implementation of ULFM that does not rely on PMI_dead_processes.

If we assume user always call MPIX_Comm_shrink after MPIX_Comm_agree, we can simply reduce MPIX_Comm_shrink to MPI_Comm_create_group. There is always a possibility that user call MPIX_Comm_shrink without reaching a consensus on failed processes or there may be new processes fail during the call, however I argue, there is no robust way to handle this situation anyway other than just return an error to user and let user handle it.

Even when we first probe a process succeeded, the process still may die before it enters MPIX_Comm_agree and send us the probe. Regularly retry probe to prevent stuck in the waiting for a probe that never arrives. Potentially we may stuck during the broadcast stage as well, but that only mean the a process died *during* MPIX_Comm_agree. Hopefully, the chances of that is low. Compared to the previous case, the process may be doing arbitrary amount of work before entering MPIX_Comm_agree and may die before then.

The probe may succeed to a dead rank then later we receive a probe from a substitute rank from the peer group. Since we never sent this substitute rank a probe, it may hang. Always verify the origin rank for received probes and send a make-up probe if it is from an unexpected rank.

hzhou force-pushed the 2601_ulfm branch from 2845a54 to 37bae72 Compare January 20, 2026 02:35

hzhou added 2 commits January 20, 2026 11:12

hzhou force-pushed the 2601_ulfm branch 2 times, most recently from 2e6e40c to 7157400 Compare January 20, 2026 18:39

hzhou added 3 commits January 20, 2026 13:21

ch4/ofi: check MAX RETRY in MPIDI_OFI_CALL_RETRY_AM

c62cc51

When user set finite MPIR_CVAR_CH4_OFI_MAX_EAGAIN_RETRY, we should check so MPIDI_OFI_CALL_RETRY_AM don't end up in infinite loop. For tcp provider at least, fi_send to a dead process result in infinite FI_EAGAIN.

ch4/ofi: avoid using MPIR_ERR_SETFATALANDJUMP

193c97a

To allow ulfm to work, we need turn off fatal error codes.

hzhou force-pushed the 2601_ulfm branch from 7157400 to 9fa3d11 Compare January 20, 2026 19:22

hzhou added 5 commits January 21, 2026 14:59

ch4/ulfm: add MPID_Comm_get_failed

6af3c84

MPID_Comm_agree will update a list of failed processes locally. This provides an alternative implementation of ULFM that does not rely on PMI_dead_processes.

hzhou force-pushed the 2601_ulfm branch from acff4c9 to cfae45c Compare January 21, 2026 21:04

hzhou changed the title ~~ulfm: implement ulfm that are independent of PMI~~ ulfm: implement ulfm that does not depend on PMI Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ulfm: implement ulfm that does not depend on PMI#7706

ulfm: implement ulfm that does not depend on PMI#7706
hzhou wants to merge 10 commits intopmodels:mainfrom
hzhou:2601_ulfm

hzhou commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hzhou commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Test Program

Notes

Author Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hzhou commented Jan 20, 2026 •

edited

Loading