Skip to content

Correctness failure when using BTL RDMA #3685

Closed
@vspetrov

Description

@vspetrov

Thank you for taking the time to submit an issue!

Background information

Possibly related to the "patcher" memory framework

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v2.0.x
v2.x
master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from sources, cloned from github

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
  • Network type: Mellanox infiniband, CIB

Details of the problem

Multithreaded correctness test (attached
mt_stress.zip
) fails with OMPI.
Reproduced on 2 nodes.

shell$ mpirun -np 3 --map-by node -mca pml ob1 -mca btl openib,self   `nif 50`    -mca coll ^hcoll ./mt_stress 1497004394

...
Splitting id 124
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 2, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 0, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 1, count 15045, comm_size 3, color 1
Splitting id 125
Splitting id 126
...

This is an allreduce failure. After some debug i narrowed it down to the single p2p inside allreduce. One ranks sends the data to the other side, but the data is received corrupted for some reason.

The test would pass if "-mca mpi_leave_pinned 0" OR if the ompi is built without memory manager support (--without-memory-manager). This is why my suspicion goes to "patcher" memory framework.

Additionally, the same issues are observed with pml yalla (mellanox mxm based p2p). Again disabling mem notifications (MXM_MEM_ON_DEMAND_MAP=n) helps.

Since "patcher" was not present in ompi_v1.10 i wanted to try test with that version. btl openib wouldn't work since it didn't support mpi_thread_multiple in 1.10 however, pml yalla works w/o errors with 1.10.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions