Skip to content

ummunotify code paths broken #429

Closed
Closed
@jsquyres

Description

@jsquyres

@jladd-mlnx sent me mail a looooong time ago indicating that ummunotify code paths in OMPI are broken:

On Dec 11, 2013, at 10:51 AM, Joshua Ladd wrote:

Gentlemen,

ummunotify was recently added to MOFED, as a result, we are now observing rcache errors on our ConnectX-3 class of HCAs with both the OMPI 1.7.X and 1.6.X series. As far as I can tell, there is no way to disqualify ummunotify at configure time and the default runtime behavior is "-1" (enable it, if you have it.) The error goes away when we pass "-mca memory_linux_ummunotify_enable 0". We would like to disable ummunotify by default until this can be resolved. Do you guys have any objections to this?

Thanks,

Josh

joshual@mir13 ~/ompi_1.6/openmpi-1.6.5 $mpirun -np 16 --display-map -bynode --bind-to-core -mca btl openib,sm,self -mca coll tuned,basic -mca btl_openib_warn_default_gid_prefix 0 -mca btl_openib_if_include  mlx4_0:1 ~/IMB/src/IMB-MPI1 -npmin 16 Bcast

========================   JOB MAP   ========================

 Data for node: mir5     Num procs: 8
     Process OMPI jobid: [16646,1] Process rank: 0
     Process OMPI jobid: [16646,1] Process rank: 2
     Process OMPI jobid: [16646,1] Process rank: 4
     Process OMPI jobid: [16646,1] Process rank: 6
     Process OMPI jobid: [16646,1] Process rank: 8
     Process OMPI jobid: [16646,1] Process rank: 10
     Process OMPI jobid: [16646,1] Process rank: 12
     Process OMPI jobid: [16646,1] Process rank: 14

 Data for node: mir6     Num procs: 8
     Process OMPI jobid: [16646,1] Process rank: 1
     Process OMPI jobid: [16646,1] Process rank: 3
     Process OMPI jobid: [16646,1] Process rank: 5
     Process OMPI jobid: [16646,1] Process rank: 7
     Process OMPI jobid: [16646,1] Process rank: 9
     Process OMPI jobid: [16646,1] Process rank: 11
     Process OMPI jobid: [16646,1] Process rank: 13
     Process OMPI jobid: [16646,1] Process rank: 15


=============================================================
 benchmarks to run Bcast
 #---------------------------------------------------
 #    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
 #---------------------------------------------------
 # Date                  : Tue Dec 10 19:28:39 2013
 # Machine               : x86_64
 # System                : Linux
 # Release               : 2.6.32-358.el6.x86_64
 # Version               : #1 SMP Tue Jan 29 11:47:41 EST 2013
 # MPI Version           : 2.1
 # MPI Thread Environment:

 # New default behavior from Version 3.2 on:

 # the number of iterations per message size is cut down
 # dynamically when a certain run time (per message size sample)
 # is expected to be exceeded. Time limit is defined by variable
 # "SECS_PER_SAMPLE" (=> IMB_settings.h) 
 # or through the flag => -time



 # Calling sequence was:

 # /hpc/home/USERS/joshual/IMB/src/IMB-MPI1 -npmin 16 Bcast

 # Minimum message length in bytes:   0
 # Maximum message length in bytes:   4194304
 #
 # MPI_Datatype                   :   MPI_BYTE
 # MPI_Datatype for reductions    :   MPI_FLOAT
 # MPI_Op                         :   MPI_SUM
 #
 #

 # List of Benchmarks to run:

 # Bcast

 #----------------------------------------------------------------
 # Benchmarking Bcast
 # #processes = 16
 #----------------------------------------------------------------
      #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
           0         1000         0.02         0.02         0.02
           1         1000         3.53         3.53         3.53
           2         1000         3.04         3.04         3.04
           4         1000         3.01         3.01         3.01
           8         1000         3.07         3.07         3.07
          16         1000         3.06         3.06         3.06
          32         1000         3.23         3.23         3.23
          64         1000         3.35         3.35         3.35
         128         1000         3.98         3.99         3.98
         256         1000         4.31         4.31         4.31
         512         1000         4.60         4.60         4.60
        1024         1000         5.31         5.31         5.31
        2048         1000         7.48         7.49         7.49
        4096         1000        11.55        11.57        11.56
        8192         1000        19.91        19.93        19.93
       16384         1000        36.06        36.11        36.09
       32768         1000        76.52        76.57        76.55
       65536          640       137.53       137.67       137.61
      131072          320       271.64       272.09       271.95
      262144          160       544.04       546.03       545.50
      524288           80      1701.00      1725.24      1720.52
     1048576           40      3385.70      3427.60      3419.10
     2097152           20      6866.00      6943.45      6924.95
 -----------------------------------------------------------------
 Open MPI intercepted a call to free memory that is still being used
 by an ongoing MPI communication.  This usually reflects an error in
 the MPI application; it may signify memory corruption.  Open MPI will
 now abort your job.

 Mpool name:     rdma
 Local host:     mir6.vbench.com
 Buffer address: 0x7ffff0200000
 Buffer size:    16384
 ---------------------------------------------------------------
 mpirun has exited due to process rank 5 with PID 3339 on node

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions