Closed
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v2.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from source tarball
Please describe the system on which you are running
- Operating system/version: RHEL7
- Computer hardware: x86-64
- Network type: infiniband
Details of the problem
I am having trouble attaching LaunchMON when using OpenMPI 2.1.1.
[LMON_FE] launching the job/daemons via /usr/workspace/wsrzd/lee218/install/toss_3_x86_64_ib/ompi-2.1.1/bin/orterun
[LMON FE] 6 RM types are supported
[warn] Epoll ADD(4) on fd 38 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 35 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[rzoz1:132008] [[21665,0],0] usock_peer_send_blocking: send() to socket 36 failed: Broken pipe (32)
[rzoz1:132008] [[21665,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 316
[rzoz1:132008] [[21665,0],0]-[[21665,1],1] usock_peer_accept: usock_peer_send_connect_ack failed
[rzoz1:132008] [[21665,0],0] usock_peer_send_blocking: send() to socket 40 failed: Broken pipe (32)
[rzoz1:132008] [[21665,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 316
[rzoz1:132008] [[21665,0],0]-[[21665,1],0] usock_peer_accept: usock_peer_send_connect_ack failed
--------------------------------------------------------------------------
orterun was unable to start the specified application as it encountered an
error:
Error name: Not supported
Node: rzoz1
when attempting to start process rank 0.
--------------------------------------------------------------------------
<Jun 06 11:51:50> <LMON FE API> (INFO): FE-ENGINE connection timed out: 120
[LMON FE] FAILED
Here's how you can reproduce (modify your PATH and the path to mpirun):
git clone https://github.com/llnl/launchmon.git
cd launchmon/
export PATH=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.1/bin:$PATH
CFLAGS="-g -O0" CXXFLAGS="-g -O0" ./configure --prefix=/nfs/tmp2/lee218/prefix/launchmon-1.0.3b --with-test-rm=orte --with-test-rm-launcher=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.1/bin/mpirun --with-test-installed --with-test-nnodes=1 && make clean && make -j 8 install && make -j 8 check
cd test/src
./test.launch_1
In addition, the LaunchMON "test.attach_1" test hangs when trying to attach. @rhc54 had previously helped me with various debugger attach issues and we had a working commit. I don't know if that made it into the release or if this is a new issue. It would be nice if LaunchMON tests could be integrated as part of the release testing.