Skip to content

error launching/attaching LaunchMON debugger with OpenMPI 2.1.1 #3660

Closed
@lee218llnl

Description

@lee218llnl

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v2.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source tarball

Please describe the system on which you are running

  • Operating system/version: RHEL7
  • Computer hardware: x86-64
  • Network type: infiniband

Details of the problem

I am having trouble attaching LaunchMON when using OpenMPI 2.1.1.

[LMON_FE] launching the job/daemons via /usr/workspace/wsrzd/lee218/install/toss_3_x86_64_ib/ompi-2.1.1/bin/orterun

[LMON FE] 6 RM types are supported
[warn] Epoll ADD(4) on fd 38 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 35 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[rzoz1:132008] [[21665,0],0] usock_peer_send_blocking: send() to socket 36 failed: Broken pipe (32)
[rzoz1:132008] [[21665,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 316
[rzoz1:132008] [[21665,0],0]-[[21665,1],1] usock_peer_accept: usock_peer_send_connect_ack failed
[rzoz1:132008] [[21665,0],0] usock_peer_send_blocking: send() to socket 40 failed: Broken pipe (32)
[rzoz1:132008] [[21665,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 316
[rzoz1:132008] [[21665,0],0]-[[21665,1],0] usock_peer_accept: usock_peer_send_connect_ack failed
--------------------------------------------------------------------------
orterun was unable to start the specified application as it encountered an
error:

Error name: Not supported
Node: rzoz1

when attempting to start process rank 0.
--------------------------------------------------------------------------
<Jun 06 11:51:50> <LMON FE API> (INFO): FE-ENGINE connection timed out: 120
[LMON FE] FAILED

Here's how you can reproduce (modify your PATH and the path to mpirun):

git clone https://github.com/llnl/launchmon.git
cd launchmon/
export PATH=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.1/bin:$PATH
CFLAGS="-g -O0" CXXFLAGS="-g -O0" ./configure --prefix=/nfs/tmp2/lee218/prefix/launchmon-1.0.3b --with-test-rm=orte --with-test-rm-launcher=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.1/bin/mpirun --with-test-installed --with-test-nnodes=1 && make clean && make -j 8 install && make -j 8 check
cd test/src
./test.launch_1

In addition, the LaunchMON "test.attach_1" test hangs when trying to attach. @rhc54 had previously helped me with various debugger attach issues and we had a working commit. I don't know if that made it into the release or if this is a new issue. It would be nice if LaunchMON tests could be integrated as part of the release testing.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions