Open
Description
An intel test is hanging in MPI_COMM_DUP (MPI_Keyval1_c and MPI_Keyval1_f), and the backtrace from one of the hung processes is a bit strange:
$ mpirun -np 4 --mca btl vader,self ./MPI_Keyval1_c
[...hangs...]
A snapshot backtrace from a hung process is:
(gdb) bt
#0 0x00002aaaaaafecb2 in opal_list_remove_first (list=0x773508) at ../opal/class/opal_list.h:670
#1 0x00002aaaaaaff885 in ompi_comm_request_progress () at communicator/comm_request.c:114
#2 0x00002aaaab1ad39e in opal_progress () at runtime/opal_progress.c:221
#3 0x00002aaaaab215a1 in ompi_request_default_test_all (count=1, requests=0x779d10, completed=0x7fffffffc00c, statuses=0x0) at request/req_test.c:214
#4 0x00002aaabbfa5095 in NBC_Progress (handle=0x7741d8) at nbc.c:326
#5 0x00002aaabbfa30ec in ompi_coll_libnbc_progress () at coll_libnbc_component.c:242
#6 0x00002aaaab1ad39e in opal_progress () at runtime/opal_progress.c:221
#7 0x00002aaaaab21d64 in ompi_request_wait_completion (req=0x773580) at ../ompi/request/request.h:397
#8 0x00002aaaaab21da2 in ompi_request_default_wait (req_ptr=0x7fffffffc180, status=0x0) at request/req_wait.c:40
#9 0x00002aaaaaaf636a in ompi_comm_set (ncomm=0x7fffffffc1f8, oldcomm=0x78ef60, local_size=0, local_ranks=0x0, remote_size=0, remote_ranks=0x0, attr=0x76d6f0, errh=0x619040 <ompi_mpi_errors_return>, copy_topocomponent=true, local_group=0x76cd10, remote_group=0x778820) at communicator/comm.c:122
#10 0x00002aaaaaaf84b2 in ompi_comm_dup_with_info (comm=0x78ef60, info=0x0, newcomm=0x7fffffffc3c8) at communicator/comm.c:988
#11 0x00002aaaaaaf83f6 in ompi_comm_dup (comm=0x78ef60, newcomm=0x7fffffffc3c8) at communicator/comm.c:969
#12 0x00002aaaaab4d2cf in PMPI_Comm_dup (comm=0x78ef60, newcomm=0x7fffffffc3c8) at pcomm_dup.c:63
#13 0x0000000000402ba0 in main (argc=1, argv=0x7fffffffc528) at MPI_Keyval1_c.c:454
Notes:
ompi_comm_dup_with_info()
is essentially waiting on a request that never completes. This seems to be the real issue. The only communication this test does is duping communicators.- A secondary issue (only happened to be noticed by this snapshot backtrace): is
opal_progress()
allowed to callopal_progress()
?
These 2 tests (the C and Fortran versions) are not hanging on the v2.x branch.