Skip to content

v2.x: MPI singleton + PMIx dstore fails #2897

Closed
@kawashima-fj

Description

@kawashima-fj

@rhc54 As discussed in #2859, when I enable the PMIx dstore, An MPI process of singleton execution (launch directly; no mpiexec) fails with the following message on v2.x branch.

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  
  orte_ess_init failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  
  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

The problem will be in fork_hnp function of the singleton ESS. It checks the number of PMIx parameters. But the number varies if dstore is enabled. Probably PMIX_DSTORE_ESH_BASE_PATH is added.

https://github.com/open-mpi/ompi/blob/v2.x/orte/mca/ess/singleton/ess_singleton_module.c#L615

615             if (4 != opal_argv_count(argv)) {
(gdb) n
616                 opal_argv_free(argv);
(gdb) p cptr
$8 = 0x6533cc "PMIX_NAMESPACE=399310849,PMIX_RANK=0,PMIX_SERVER_URI=pmix-server:22133:/tmp/openmpi-sessions-1000@imtofu2_0/6093/pmix-22133,PMIX_SECURITY_MODE=native,PMIX_DSTORE_ESH_BASE_PATH=/tmp/openmpi-sessions-10"...

The master seems to have the solution. Probably a1e8e58. Cherry-picking this commit is sufficient?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions