-
Notifications
You must be signed in to change notification settings - Fork 912
Open
Labels
Milestone
Description
Thank you for taking the time to submit an issue!
Background information
On a 10 node cluster running OSU microbenchmark. Over ethernet using CISCO switch and UCX but disabling ROCE as we can't enable PFC. Just trying to get TCP to work over UCX with HCOLL.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.2rc2-1.55103
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from Mellanox OFED MLNX_OFED_LINUX-5.5-1.0.3.2 (OFED-5.5-1.0.3):
Please describe the system on which you are running
- Operating system/version: Linux seren-01 4.19.0-22-amd64 BTL checkpoint friendly #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
- Computer hardware: Dual core Intel with Mellanox Technologies MT27800 Family [ConnectX-5]
- Network type:100G ethernet
Details of the problem
Running OSU bencmark as follows, I occasionally get a hang after it finishes doing the actual benchmark. It appears to be in the MPI finalise stage.
$ ./runalltoallbench.sh
/data/seren-01/fast/craco//openmpi-4.1.2rc2//tests/osu-micro-benchmarks-5.6.2//osu_alltoall
UCX_TLS=self,tcp,mm,cma
UCX_IB_GID_INDEX=2
UCX_IB_SL=2
UCX_NET_DEVICES=enp216s0
mpirun -v -map-by ppr:1:node --mca pml ucx -x UCX_TLS -x UCX_IB_GID_INDEX -x UCX_NET_DEVICES --mca oob_tcp_if_include enp216s0 --mca oob_base_verbose 0 --mca coll_hcoll_enable 1 -x HCOLL_VERBOSE -hostfile mpi_seren.txt /data/seren-01/fast/craco//openmpi-4.1.2rc2//tests/osu-micro-benchmarks-5.6.2//osu_alltoall -m 2:200000 -f
[1668732944.988683] [seren-01:39621:0] parser.c:1909 UCX WARN unused env variable: UCX_IB_SL (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.2
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
2 122.91 97.43 135.87 1000
4 118.29 92.21 134.38 1000
8 116.04 91.81 145.19 1000
16 124.69 98.19 140.96 1000
32 118.63 93.41 144.63 1000
64 117.68 93.65 149.82 1000
128 98.16 86.31 114.54 1000
256 107.25 93.40 123.33 1000
512 122.67 103.50 150.80 1000
1024 90.04 78.93 97.40 1000
2048 113.82 91.15 127.34 1000
4096 119.14 95.94 139.45 1000
8192 121.98 104.87 131.63 1000
16384 144.52 116.22 161.71 100
32768 192.41 159.97 207.95 100
65536 387.51 328.33 423.11 100
131072 699.19 646.84 755.95 100
...
<hang>
<ctrl-C>
$
If I GDB the process (which is 100% CPU) I get the following backtrace:
$ ps -ef | grep osu
ban115 39487 39484 0 08:51 pts/0 00:00:00 mpirun -v -map-by ppr:1:node --mca pml ucx -x UCX_TLS -x UCX_IB_GID_INDEX -x UCX_NET_DEVICES --mca oob_tcp_if_include enp216s0 --mca oob_base_verbose 0 --mca coll_hcoll_enable 1 -x HCOLL_VERBOSE -hostfile mpi_seren.txt /data/seren-01/fast/craco//openmpi-4.1.2rc2//tests/osu-micro-benchmarks-5.6.2//osu_alltoall -m 2:200000 -f
ban115 39500 39487 99 08:51 pts/0 00:00:39 /data/seren-01/fast/craco//openmpi-4.1.2rc2//tests/osu-micro-benchmarks-5.6.2//osu_alltoall -m 2 200000 -f
ban115 39519 7197 0 08:52 pts/9 00:00:00 grep osu
(venv) ban115@seren-01:/data/seren-01/fast/ban115/build/craco-python/mpitests$ gdb -p 39500
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 39500
[New LWP 39501]
[New LWP 39502]
[New LWP 39505]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f74a92d538f in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f74a92d538f in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f74a6ef8830 in ucs_event_set_wait () from /lib/libucs.so.0
#2 0x00007f74a6f42d1b in uct_tcp_iface_progress () from /lib/libuct.so.0
#3 0x00007f74a6f97c12 in ucp_worker_progress () from /lib/libucp.so.0
#4 0x00007f74a7027f10 in opal_common_ucx_mca_pmix_fence (worker=0x558a27d119d0) at common_ucx.c:394
#5 0x00007f74a70280c2 in opal_common_ucx_del_procs (procs=procs@entry=0x558a27e3d9c0, count=count@entry=10, my_rank=<optimized out>, max_disconnect=<optimized out>,
worker=<optimized out>) at common_ucx.c:470
#6 0x00007f74a7030889 in mca_pml_ucx_del_procs (procs=0x558a27eaa2b0, nprocs=10) at pml_ucx.c:499
#7 0x00007f74a940d857 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:338
#8 0x0000558a264f1fd6 in ?? ()
#9 0x00007f74a920009b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000558a264f12ea in ?? ()