Description
As the title says, we've been seeing some mpi4py CI failures on main
and v5.0.x
recently.
C reproducer
I've managed to reproduce the spawn test failures locally on my mac. The problem is that they're non-deterministic. 🙁
I've written a short C reproducer. It only seems to trip the error — sometimes! — when we run a bunch of Comm spawns in a single process.
// C version of an mpi4py test, blatently stolen and converted to C
// from
// https://github.com/mpi4py/mpi4py/blob/master/test/test_spawn.py#L205-L217
#include <stdio.h>
#include <mpi.h>
void do_child(MPI_Comm parent)
{
MPI_Barrier(parent);
MPI_Comm_disconnect(&parent);
}
void do_parent(char *argv[])
{
const int count = 3;
char *commands[count] = { argv[0], argv[0], argv[0] };
int maxprocs[3] = { 1, 1, 1 };
MPI_Comm child;
int errcodes[3];
MPI_Info infos[] = { MPI_INFO_NULL, MPI_INFO_NULL, MPI_INFO_NULL };
MPI_Comm_spawn_multiple(count, commands, MPI_ARGVS_NULL,
maxprocs, infos, 0,
MPI_COMM_SELF, &child,
errcodes);
int local_size, remote_size;
MPI_Comm_size(child, &local_size);
MPI_Comm_remote_size(child, &remote_size);
MPI_Barrier(child);
MPI_Comm_disconnect(&child);
MPI_Barrier(MPI_COMM_SELF);
if (local_size != 1) {
printf("WARNING: local_size == %d, expected 1\n", local_size);
}
if (remote_size != count) {
printf("WARNING: remote_size == %d, expected %d\n",
remote_size, count);
}
}
int main(int argc, char* argv[])
{
MPI_Init(NULL, NULL);
MPI_Barrier(MPI_COMM_SELF);
MPI_Comm parent;
MPI_Comm_get_parent(&parent);
if (parent == MPI_COMM_NULL) {
for (int i = 0; i < 32; ++i) {
do_parent(argv);
}
} else {
do_child(parent);
}
MPI_Barrier(MPI_COMM_SELF);
MPI_Finalize();
return 0;
}
Compile and run it with:
mpicc -g mpi4py-comm-spawn-defaults1.c -o mcsd
mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 mcsd
If I run this a few times, it will definitely fail at least once.
Supplemental detail
Sometimes the mpi4py tests all succeed (!). Sometimes one the spawn tests randomly fails.
If you want to see the failure in the original mpi4py test suite, the good news is that there is a pytest command to rapidly re-run just the spawn tests. I find that this command fails once every several iterations:
mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 python3 test/main.py -v -f -k CommSpawn
The -k CommSpawn
is the selector — it runs any test that includes CommSpawn in the name (I think it's case sensitive...?). This ends up only being 16 tests (out of the entire mpi4py test suite) and when it succeeds, it only takes 2-3 seconds.
Here's a sample output from an mpi4py test that fails (it's not always this test):
testCommSpawnDefaults1 (test_spawn.TestSpawnMultipleSelfMany.testCommSpawnDefaults1) ... [JSQUYRES-M-4LRP:00000] *** An error occurred in Socket closed
[JSQUYRES-M-4LRP:00000] *** reported by process [1182269441,0]
[JSQUYRES-M-4LRP:00000] *** on a NULL communicator
[JSQUYRES-M-4LRP:00000] *** Unknown error
[JSQUYRES-M-4LRP:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[JSQUYRES-M-4LRP:00000] *** and MPI will try to terminate your MPI job as well)