-
Notifications
You must be signed in to change notification settings - Fork 931
Description
Periodically, the Cisco MTT sees odd TCP BTL addressing failures.
For example, the one-sided TCP BTL failures from last night/s MTT on master (there's ORTE errors there, too -- ignore those): https://mtt.open-mpi.org/index.php?do_redir=2399
It looks like the TCP BTL fails the incoming connection, which then causes a frag mismatch size in OB1. Here's a stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 0x0000003370632925 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2aaab959a700 (LWP 1394))]
(gdb) bt
#0 0x0000003370632925 in raise () from /lib64/libc.so.6
#1 0x0000003370634105 in abort () from /lib64/libc.so.6
#2 0x000000337062ba4e in __assert_fail_base () from /lib64/libc.so.6
#3 0x000000337062bb10 in __assert_fail () from /lib64/libc.so.6
#4 0x00002aaaaacdbaec in mca_pml_ob1_put_completion (frag=0x781380,
rdma_size=26666) at pml_ob1_recvreq.c:197
#5 0x00002aaaaacd80b6 in mca_pml_ob1_recv_frag_callback_fin (btl=0x6dd500,
tag=73 'I', des=0x2aaab9299080, cbdata=0x0) at pml_ob1_recvfrag.c:434
#6 0x00002aaaab4d0108 in mca_btl_tcp_endpoint_recv_handler (sd=24, flags=2,
user=0x739220) at btl_tcp_endpoint.c:893
#7 0x00002aaaab512393 in event_persist_closure (base=0x65c4e0, ev=0x739510)
at event.c:1321
#8 0x00002aaaab5124a2 in event_process_active_single_queue (base=0x65c4e0,
activeq=0x65c9d0) at event.c:1365
#9 0x00002aaaab51276f in event_process_active (base=0x65c4e0) at event.c:1440
#10 0x00002aaaab512dc2 in opal_libevent2022_event_base_loop (base=0x65c4e0,
flags=1) at event.c:1644
#11 0x00002aaaab4cbe89 in mca_btl_tcp_progress_thread_engine (
obj=0x2aaaab8fc1c0 <mca_btl_tcp_progress_thread>)
at btl_tcp_component.c:781
#12 0x0000003370a079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00000033706e8b6d in clone () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb)
The configuration of this run was:
"CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --without-memory-manager
The corresponding output from the failed run was:
================ test_put6 ========== Sat Feb 25 03:08:56 2017
[mpi012:12152] btl: tcp: Incoming connection from 10.2.0.7 does not match known addresses for peer
[[28385,1],0]. Drop !
[mpi007:01397] btl: tcp: Incoming connection from 10.2.0.12 does not match known addresses for peer
[[28385,1],27]. Drop !
test_put6: pml_ob1_recvreq.c:197: mca_pml_ob1_put_completion: Assertion `(uint64_t) rdma_size ==
frag->rdma_length' failed.
[mpi007:01372] *** Process received signal ***
[mpi007:01372] Signal: Aborted (6)
[mpi007:01372] Signal code: (-6)[mpi007:01372] [ 0] /lib64/libpthread.so.0[0x3370a0f710]
[mpi007:01372] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3370632925]
[mpi007:01372] [ 2] /lib64/libc.so.6(abort+0x175)[0x3370634105]
[mpi007:01372] [ 3] /lib64/libc.so.6[0x337062ba4e]
[mpi007:01372] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x337062bb10]
[mpi007:01372] [ 5]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libmpi.so.0(+0x22eaec)[0x2aaaaacdbaec]
[mpi007:01372] [ 6]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaacd80b6]
[mpi007:01372] [ 7]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0xca108)[0x2aaaab4d0108]
[mpi007:01372] [ 8]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c393)[0x2aaaab512393]
[mpi007:01372] [ 9]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c4a2)[0x2aaaab5124a2]
[mpi007:01372] [10]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c76f)[0x2aaaab51276f]
[mpi007:01372] [11]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x298)[0x2aaaab512dc2]
[mpi007:01372] [12]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0xc5e89)[0x2aaaab4cbe89]
[mpi007:01372] [13] /lib64/libpthread.so.0[0x3370a079d1]
[mpi007:01372] [14] /lib64/libc.so.6(clone+0x6d)[0x33706e8b6d]
[mpi007:01372] *** End of error message ***-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: mpi012
Local PID: 12152
Peer host: mpi007
----------------------------------------------------------------------------------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi007 exited on signal 6 (Aborted).
--------------------------------------------------------------------------[mpi007:01352] 47 more
processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[mpi007:01352] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Note that 10.2.0.7 and 10.2.0.12 are valid IP addresses for MPI processes in this cluster, and actually correspond to the nodes that this SLURM MTT job was running on (mpi007 and mpi012). Here's an ifconfig from mpi002:
eth6 Link encap:Ethernet HWaddr 24:57:20:02:50:00
inet addr:10.3.0.2 Bcast:10.3.255.255 Mask:255.255.0.0
inet6 addr: fe80::2657:20ff:fe02:5000/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:543911395 errors:0 dropped:2341 overruns:0 frame:0
TX packets:558728146 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:707847811000 (659.2 GiB) TX bytes:779662168466 (726.1 GiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1198187857 errors:0 dropped:0 overruns:0 frame:0
TX packets:1198187857 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2382222734575 (2.1 TiB) TX bytes:2382222734575 (2.1 TiB)
lom0 Link encap:Ethernet HWaddr A4:4C:11:2A:72:68
inet addr:10.0.8.2 Bcast:10.0.255.255 Mask:255.255.0.0
inet6 addr: fe80::a64c:11ff:fe2a:7268/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:32785162 errors:0 dropped:0 overruns:0 frame:0
TX packets:4297768 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7981061130 (7.4 GiB) TX bytes:682224306 (650.6 MiB)
Memory:cad00000-cae00000
vic20 Link encap:Ethernet HWaddr FC:99:47:25:2C:13
inet addr:10.10.0.2 Bcast:10.10.255.255 Mask:255.255.0.0
inet6 addr: fe80::fe99:47ff:fe25:2c13/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:595480496 errors:0 dropped:0 overruns:0 frame:0
TX packets:628872360 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1160229624365 (1.0 TiB) TX bytes:1379139113686 (1.2 TiB)
vic21 Link encap:Ethernet HWaddr FC:99:47:25:2C:14
inet addr:10.2.0.2 Bcast:10.2.255.255 Mask:255.255.0.0
inet6 addr: fe80::fe99:47ff:fe25:2c14/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:508473520 errors:0 dropped:0 overruns:0 frame:0
TX packets:528095836 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:526753524743 (490.5 GiB) TX bytes:607706247980 (565.9 GiB)
Hence, 10.2.0.7 and 10.2.0.21 should both be valid IP addresses for MPI processes in this job.
I'm a little confused by what is happening here at the abort point in the code, because it's in the case for CONNECTED, so it must be complaining about some other fragment -- not the "bad" inbound connection.
@bosilca Can you have a look?