Skip to content

v2.x: XRC UCDM openib failures while running Mellanox CI #3890

@artpol84

Description

@artpol84

Background information

Silent Mellanox jenkins failures was observed recently.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Failures seems to be observed for GitHub v2.x branch only.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Regular Mellanox CI build

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
  • Computer hardware: x86_64
  • Network type: Mellanox mlx5 adapters

Details of the problem

The following command silently fails:

20:54:55 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 \
-bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout \
--get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 \
-x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm \
-mca pml ob1 -mca btl self,openib \
-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 \
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
20:54:55 [1499968495.528199] [jenkins03:1355 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.535609] [jenkins03:1354 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.534361] [jenkins03:1359 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.541761] [jenkins03:1356 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.552215] [jenkins03:1360 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.560606] [jenkins03:1361 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.562930] [jenkins03:1353 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.567548] [jenkins03:1363 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:56 + jenkins_cleanup
20:54:56 + echo 'Script exited with code = 1'
20:54:56 Script exited with code = 1
20:54:56 + rm -rf /tmp/tmp.8mj45mghXh
20:54:56 + echo 'rm -rf ... returned 0'
20:54:56 rm -rf ... returned 0

While expected output is

21:43:05 Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)

Same command with btl/tcp works fine:

$/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun --debug-daemons -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout --get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[jenkins03:01400] [[15875,0],0] orted_cmd: received add_local_procs
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 8
  MPIR_proctable:
    (i, host, exe, pid) = (0, jenkins03, /usr/bin/taskset, 1416)
    (i, host, exe, pid) = (1, jenkins03, /usr/bin/taskset, 1417)
    (i, host, exe, pid) = (2, jenkins03, /usr/bin/taskset, 1419)
    (i, host, exe, pid) = (3, jenkins03, /usr/bin/taskset, 1420)
    (i, host, exe, pid) = (4, jenkins03, /usr/bin/taskset, 1421)
    (i, host, exe, pid) = (5, jenkins03, /usr/bin/taskset, 1423)
    (i, host, exe, pid) = (6, jenkins03, /usr/bin/taskset, 1428)
    (i, host, exe, pid) = (7, jenkins03, /usr/bin/taskset, 1431)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD
[jenkins03:01400] [[15875,0],0] orted_cmd: received exit cmd
[jenkins03:01400] [[15875,0],0] orted_cmd: all routes and children gone - exiting

Here is more detailed log (with btl verbose on):
openib_failure.txt

Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions