Skip to content

multiple fences issue in pmix #388

@elenash

Description

@elenash

In case if multiple fences run across the same processes (across several jobs), orte_proc_t object's flags can contain incorrect value since unsetting them is not done.

It was found by running simple test which does doubled fence (1st time in mpi_init, 2d - explicitly) with shared memory dstore enabled. But it may either affect spawning jobs and, I guess, lead to problems with other flags (like DATA_RECVD).

The flow leading to problems with shared memory dstore is the following:
During MPI_Init at the first fence with data exchange the flag field DATA_IN_SM is set for all processes participating in it. After that new data were put explicitly and new fence is launched. Since DATA_IN_SM flag field is already set for any process pmi_get trying to fetch new data for this process returns immediately while in fact there is no data in shared memory dstore yet. If fence release callback happens earlier then handling of get command, data aren't stored either due to this flag.

DATA_IN_SM flag field is used to cope with multiple times storage of the same data for the process as we discussed earlier but it turned out that it doesn't work properly.

I tried to fix it by unsetting DATA_IN_SM flag field on start of fence at all processes participating in signature or if it's null at all processes from the job. It helped at small scale but at large scale there are other flows which are not satisfied with this solution. Moreover, if several fences are running in parallel for the same processes, this solution doesn't work anymore. It looks like we should come up with another solution for dealing with multiple fences.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions