Description
Thank you for taking the time to submit an issue!
Background information
Possibly related to the "patcher" memory framework
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v2.0.x
v2.x
master
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from sources, cloned from github
Please describe the system on which you are running
- Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
- Computer hardware: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
- Network type: Mellanox infiniband, CIB
Details of the problem
Multithreaded correctness test (attached
mt_stress.zip
) fails with OMPI.
Reproduced on 2 nodes.
shell$ mpirun -np 3 --map-by node -mca pml ob1 -mca btl openib,self `nif 50` -mca coll ^hcoll ./mt_stress 1497004394
...
Splitting id 124
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 2, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 0, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 1, count 15045, comm_size 3, color 1
Splitting id 125
Splitting id 126
...
This is an allreduce failure. After some debug i narrowed it down to the single p2p inside allreduce. One ranks sends the data to the other side, but the data is received corrupted for some reason.
The test would pass if "-mca mpi_leave_pinned 0" OR if the ompi is built without memory manager support (--without-memory-manager). This is why my suspicion goes to "patcher" memory framework.
Additionally, the same issues are observed with pml yalla (mellanox mxm based p2p). Again disabling mem notifications (MXM_MEM_ON_DEMAND_MAP=n) helps.
Since "patcher" was not present in ompi_v1.10 i wanted to try test with that version. btl openib wouldn't work since it didn't support mpi_thread_multiple in 1.10 however, pml yalla works w/o errors with 1.10.