Skip to content

grpcomm errors when launching on RHEL 7.2/ssh #1215

Closed
@jsquyres

Description

@jsquyres

I'm seeing odd behavior when trying to launch small MPI jobs on master (as of Sun 13 Dec 2015, after @rhc54's update to pmix 1.1.2).

Here's the specs:

  • RHEL 7.2
  • TCP BTL
  • ssh launcher (no SLURM or any other scheduler)
  • (mostly) Default master build: ./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-vt --disable-mpi-fortran
    • Yes, I built with libfabric/usnic, but I'm intentionally testing with the TCP BTL just to ensure something isn't wrong with the usnic BTL -- but I'm seeing the same behavior regardless of BTL selection

Here's what I'm launching:

$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c

The hostfile contains a bunch of lines like this: hostname slots=16

Sometimes that runs fine, sometimes it results in the following:

$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 294
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 254  
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 241  
malloc debug: Request for 4 zeroed elements of size -1 failed (grpcomm_brks.c, 92)
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 170

FWIW, I observed this same behavior this past Thursday (i.e., before the pmix 1.1.2 update), but didn't have the time to file a proper bug report. This suggests that the problem might be unrelated to the old-vs.-new PMIX...?

Here's a gist of a failed run, but with lots of verbosity, in case it helps. Here's the command line used to launch that run:

$ mpirun \
    --mca ess_base_verbose 100 \
    --mca grpcomm_base_verbose 100 \
    --mca pmix_base_verbose 100 \
    --mca pml ob1 \
    --mca btl tcp,vader,self \
    --hostfile hosts \
    -np 40 \
    ring_c

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions