Closed
Description
I'm seeing odd behavior when trying to launch small MPI jobs on master (as of Sun 13 Dec 2015, after @rhc54's update to pmix 1.1.2).
Here's the specs:
- RHEL 7.2
- TCP BTL
- ssh launcher (no SLURM or any other scheduler)
- (mostly) Default master build:
./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-vt --disable-mpi-fortran
- Yes, I built with libfabric/usnic, but I'm intentionally testing with the TCP BTL just to ensure something isn't wrong with the usnic BTL -- but I'm seeing the same behavior regardless of BTL selection
Here's what I'm launching:
$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
The hostfile contains a bunch of lines like this: hostname slots=16
Sometimes that runs fine, sometimes it results in the following:
$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 294
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 254
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 241
malloc debug: Request for 4 zeroed elements of size -1 failed (grpcomm_brks.c, 92)
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 170
FWIW, I observed this same behavior this past Thursday (i.e., before the pmix 1.1.2 update), but didn't have the time to file a proper bug report. This suggests that the problem might be unrelated to the old-vs.-new PMIX...?
Here's a gist of a failed run, but with lots of verbosity, in case it helps. Here's the command line used to launch that run:
$ mpirun \
--mca ess_base_verbose 100 \
--mca grpcomm_base_verbose 100 \
--mca pmix_base_verbose 100 \
--mca pml ob1 \
--mca btl tcp,vader,self \
--hostfile hosts \
-np 40 \
ring_c