-
Notifications
You must be signed in to change notification settings - Fork 936
Closed
Labels
Description
Background information
Silent Mellanox jenkins failures was observed recently.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Failures seems to be observed for GitHub v2.x branch only.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Regular Mellanox CI build
Please describe the system on which you are running
- Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
- Computer hardware: x86_64
- Network type: Mellanox mlx5 adapters
Details of the problem
The following command silently fails:
20:54:55 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 \
-bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout \
--get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 \
-x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm \
-mca pml ob1 -mca btl self,openib \
-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 \
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
20:54:55 [1499968495.528199] [jenkins03:1355 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.535609] [jenkins03:1354 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.534361] [jenkins03:1359 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.541761] [jenkins03:1356 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.552215] [jenkins03:1360 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.560606] [jenkins03:1361 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.562930] [jenkins03:1353 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.567548] [jenkins03:1363 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:56 + jenkins_cleanup
20:54:56 + echo 'Script exited with code = 1'
20:54:56 Script exited with code = 1
20:54:56 + rm -rf /tmp/tmp.8mj45mghXh
20:54:56 + echo 'rm -rf ... returned 0'
20:54:56 rm -rf ... returned 0
While expected output is
21:43:05 Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
Same command with btl/tcp works fine:
$/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun --debug-daemons -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout --get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[jenkins03:01400] [[15875,0],0] orted_cmd: received add_local_procs
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 8
MPIR_proctable:
(i, host, exe, pid) = (0, jenkins03, /usr/bin/taskset, 1416)
(i, host, exe, pid) = (1, jenkins03, /usr/bin/taskset, 1417)
(i, host, exe, pid) = (2, jenkins03, /usr/bin/taskset, 1419)
(i, host, exe, pid) = (3, jenkins03, /usr/bin/taskset, 1420)
(i, host, exe, pid) = (4, jenkins03, /usr/bin/taskset, 1421)
(i, host, exe, pid) = (5, jenkins03, /usr/bin/taskset, 1423)
(i, host, exe, pid) = (6, jenkins03, /usr/bin/taskset, 1428)
(i, host, exe, pid) = (7, jenkins03, /usr/bin/taskset, 1431)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD
[jenkins03:01400] [[15875,0],0] orted_cmd: received exit cmd
[jenkins03:01400] [[15875,0],0] orted_cmd: all routes and children gone - exiting
Here is more detailed log (with btl verbose on):
openib_failure.txt
Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.