Closed
Description
@jladd-mlnx sent me mail a looooong time ago indicating that ummunotify code paths in OMPI are broken:
On Dec 11, 2013, at 10:51 AM, Joshua Ladd wrote:
Gentlemen,
ummunotify was recently added to MOFED, as a result, we are now observing rcache errors on our ConnectX-3 class of HCAs with both the OMPI 1.7.X and 1.6.X series. As far as I can tell, there is no way to disqualify ummunotify at configure time and the default runtime behavior is "-1" (enable it, if you have it.) The error goes away when we pass "-mca memory_linux_ummunotify_enable 0". We would like to disable ummunotify by default until this can be resolved. Do you guys have any objections to this?
Thanks,
Josh
joshual@mir13 ~/ompi_1.6/openmpi-1.6.5 $mpirun -np 16 --display-map -bynode --bind-to-core -mca btl openib,sm,self -mca coll tuned,basic -mca btl_openib_warn_default_gid_prefix 0 -mca btl_openib_if_include mlx4_0:1 ~/IMB/src/IMB-MPI1 -npmin 16 Bcast
======================== JOB MAP ========================
Data for node: mir5 Num procs: 8
Process OMPI jobid: [16646,1] Process rank: 0
Process OMPI jobid: [16646,1] Process rank: 2
Process OMPI jobid: [16646,1] Process rank: 4
Process OMPI jobid: [16646,1] Process rank: 6
Process OMPI jobid: [16646,1] Process rank: 8
Process OMPI jobid: [16646,1] Process rank: 10
Process OMPI jobid: [16646,1] Process rank: 12
Process OMPI jobid: [16646,1] Process rank: 14
Data for node: mir6 Num procs: 8
Process OMPI jobid: [16646,1] Process rank: 1
Process OMPI jobid: [16646,1] Process rank: 3
Process OMPI jobid: [16646,1] Process rank: 5
Process OMPI jobid: [16646,1] Process rank: 7
Process OMPI jobid: [16646,1] Process rank: 9
Process OMPI jobid: [16646,1] Process rank: 11
Process OMPI jobid: [16646,1] Process rank: 13
Process OMPI jobid: [16646,1] Process rank: 15
=============================================================
benchmarks to run Bcast
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date : Tue Dec 10 19:28:39 2013
# Machine : x86_64
# System : Linux
# Release : 2.6.32-358.el6.x86_64
# Version : #1 SMP Tue Jan 29 11:47:41 EST 2013
# MPI Version : 2.1
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /hpc/home/USERS/joshual/IMB/src/IMB-MPI1 -npmin 16 Bcast
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Bcast
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 16
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.02 0.02
1 1000 3.53 3.53 3.53
2 1000 3.04 3.04 3.04
4 1000 3.01 3.01 3.01
8 1000 3.07 3.07 3.07
16 1000 3.06 3.06 3.06
32 1000 3.23 3.23 3.23
64 1000 3.35 3.35 3.35
128 1000 3.98 3.99 3.98
256 1000 4.31 4.31 4.31
512 1000 4.60 4.60 4.60
1024 1000 5.31 5.31 5.31
2048 1000 7.48 7.49 7.49
4096 1000 11.55 11.57 11.56
8192 1000 19.91 19.93 19.93
16384 1000 36.06 36.11 36.09
32768 1000 76.52 76.57 76.55
65536 640 137.53 137.67 137.61
131072 320 271.64 272.09 271.95
262144 160 544.04 546.03 545.50
524288 80 1701.00 1725.24 1720.52
1048576 40 3385.70 3427.60 3419.10
2097152 20 6866.00 6943.45 6924.95
-----------------------------------------------------------------
Open MPI intercepted a call to free memory that is still being used
by an ongoing MPI communication. This usually reflects an error in
the MPI application; it may signify memory corruption. Open MPI will
now abort your job.
Mpool name: rdma
Local host: mir6.vbench.com
Buffer address: 0x7ffff0200000
Buffer size: 16384
---------------------------------------------------------------
mpirun has exited due to process rank 5 with PID 3339 on node