Description
Hi, I've been seeing some semi-rare hangs, crashes, and assertion fails with OpenMPI v5.0.0rc7. It looks to me like the culprit is in ob1+uct (one/both of them or their combination).
I'm using the OSU collectives, and the problem appears at either the 8 KB or the 16 KB messages, for what that's worth.
With a non-debug build, I've observed hangs, crashes, or abrupt performance loss. With a debug build, an assertion in ob1 fails.
I'm on OpenMPI v5.0.0rc7 with UCX v1.12.1.
- I'm testing on two different clusters with IB/mlx5, and have had problems with several node counts (2, 4, 5, 8, 10)
- It has happened with osu_{allreduce,reduce,bcast} (non-exhaustive)
- It has happened with more than one collectives components -- tuned (standalone), and various up/low combinations of HAN
- I haven't seen it with pml/ucx, though I haven't tested this as rigorously
(1) Assert fail, tuned
# HAN simple tuned+tuned, debug build, ob1+uct, osu_allreduce
osu_allreduce: pml_ob1_sendreq.h:234: mca_pml_ob1_send_request_fini: Assertion `NULL == sendreq->rdma_frag' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
[ 0] /lib64/libpthread.so.0(+0x12c20)[0x1490a8538c20]
[ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1490a819837f]
[ 2] /lib64/libc.so.6(abort+0x127)[0x1490a8182db5]
[ 3] /lib64/libc.so.6(+0x2fa76)[0x1490a8190a76]
[ 4] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(+0x22d238)[0x1490a8f1a238]
[ 5] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(mca_pml_ob1_send+0x6aa)[0x1490a8f1bf80]
[ 6] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(ompi_coll_base_sendrecv_actual+0xd9)[0x1490a8e7aa14]
[ 7] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(+0x190216)[0x1490a8e7d216]
[ 8] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(ompi_coll_base_allreduce_intra_redscat_allgather+0x810)[0x1490a8e7f799]
[ 9] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x485)[0x1490a63da17a]
[10] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_do_this+0x1d5)[0x14fec20cbb4b]
[11] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(PMPI_Allreduce+0x425)[0x1490a8db8957]
[12] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(mca_coll_han_allreduce_intra_simple+0x181d)[0x14fec20f6b12]
[13] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(mca_coll_han_allreduce_intra_dynamic+0x375)[0x14fec2100bb5]
[14] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1490a8184493]
[15] osu_allreduce[0x401a65]
[16] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14fec3e6e493]
[17] osu_allreduce[0x401e6e]
*** End of error message ***
Multiple processes abort at the same assertion, though I only show the trace of one.
(2) Assert fail, libnbc
# HAN libnbc+tuned, debug build, ob1+uct, osu_allreduce
osu_allreduce: pml_ob1_sendreq.h:234: mca_pml_ob1_send_request_fini: Assertion `NULL == sendreq->rdma_frag' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
[ 0] /lib64/libpthread.so.0(+0x12c20)[0x151b718f0c20]
[ 1] /lib64/libc.so.6(gsignal+0x10f)[0x151b7155037f]
[ 2] /lib64/libc.so.6(abort+0x127)[0x151b7153adb5]
[ 3] /lib64/libc.so.6(+0x21c89)[0x151b7153ac89]
[ 4] /lib64/libc.so.6(+0x2fa76)[0x151b71548a76]
[ 5] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(+0x23ecb9)[0x151b722e3cb9]
[ 6] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(+0x23f329)[0x151b722e4329]
[ 7] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_libnbc.so(+0x8956)[0x151b7009f956]
[ 8] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1a3)[0x151b700a0d3b]
[ 9] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0xcc)[0x151b7009e74d]
[10] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libopen-pal.so.80(opal_progress+0x30)[0x151b72508179]
[11] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(+0xa1957)[0x151b72146957]
[12] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(ompi_request_default_wait_all+0x226)[0x151b72147704]
[13] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(+0x19cc5)[0x151b70077cc5]
[14] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(+0x17475)[0x151b70075475]
[15] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(mca_coll_han_allreduce_intra+0x1a11)[0x151b70076f73]
[16] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/openmpi/mca_coll_han.so(mca_coll_han_allreduce_intra_dynamic+0x375)[0x151b70083bb5]
[17] /p/project/deepsea/katevenis1/openmpi-5xdbg/lib/libmpi.so.80(PMPI_Allreduce+0x425)[0x151b72170957]
[18] osu_allreduce[0x401b0c]
[19] /lib64/libc.so.6(__libc_start_main+0xf3)[0x151b7153c493]
[20] osu_allreduce[0x40261e]
*** End of error message ***
(3) Segfault
# HAN simple tuned+tuned, release build, ob1+uct, osu_allreduce
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
==== backtrace (tid: 10862) ====
0 0x0000000000012c20 __funlockfile() :0
1 0x000000000018740f mca_pml_ob1_send_request_copy_in_out() ???:0
2 0x000000000018077e mca_pml_ob1_recv_frag_callback_ack() ???:0
3 0x00000000000acb14 mca_btl_uct_am_handler() ???:0
4 0x000000000004e0e5 uct_iface_invoke_am() /p/project/deepsea/katevenis1/ucx-git/src/uct/base/uct_iface.h:773
5 0x000000000004e0e5 uct_rc_mlx5_iface_common_am_handler() /p/project/deepsea/katevenis1/ucx-git/src/uct/ib/rc/accel/rc_mlx5.inl:427
6 0x000000000004e0e5 uct_rc_mlx5_iface_common_poll_rx() /p/project/deepsea/katevenis1/ucx-git/src/uct/ib/rc/accel/rc_mlx5.inl:1455
7 0x000000000004e0e5 uct_dc_mlx5_iface_progress() /p/project/deepsea/katevenis1/ucx-git/src/uct/ib/dc/dc_mlx5.c:270
8 0x000000000004e0e5 uct_dc_mlx5_iface_progress_ll() /p/project/deepsea/katevenis1/ucx-git/src/uct/ib/dc/dc_mlx5.c:285
9 0x00000000000ab462 mca_btl_uct_tl_progress.part.0() btl_uct_component.c:0
10 0x00000000000ab887 mca_btl_uct_component_progress() btl_uct_component.c:0
11 0x0000000000031744 opal_progress() ???:0
12 0x000000000017886c ompi_request_wait_completion() pml_ob1_isend.c:0
13 0x000000000017a919 mca_pml_ob1_send() ???:0
14 0x00000000000f9f5e ompi_coll_base_sendrecv_actual() ???:0
15 0x00000000000ff021 ompi_coll_base_allreduce_intra_redscat_allgather() ???:0
16 0x0000000000005b9b ompi_coll_tuned_allreduce_intra_dec_fixed() ???:0
17 0x000000000000bf3c mca_coll_han_allreduce_intra_simple() ???:0
18 0x000000000009ea14 MPI_Allreduce() ???:0
19 0x0000000000401a65 main() /p/project/deepsea/katevenis1/osu-micro-benchmarks/mpi/collective/osu_allreduce.c:114
20 0x0000000000023493 __libc_start_main() ???:0
21 0x0000000000401e6e _start() ???:0
=================================
Reproducibility is a bit hit-and-miss, and I kind of experiment with different combinations until it starts happening. My current consistent reproducer is:
$ while true; do mpirun --mca pml ob1 --mca btl sm,self,uct --mca coll basic,libnbc,han,tuned --mca coll_han_priority 97 --mca coll_han_use_simple_allreduce true osu_allreduce -m 32K; done
Any ideas what might be wrong? Can someone else reproduce this? I'm not sure how to dig deeper into this, so any approaches that might yield more interesting debugging output are welcome.