Skip to content

master: TCP BTL addressing fail #3035

@jsquyres

Description

@jsquyres

Periodically, the Cisco MTT sees odd TCP BTL addressing failures.

For example, the one-sided TCP BTL failures from last night/s MTT on master (there's ORTE errors there, too -- ignore those): https://mtt.open-mpi.org/index.php?do_redir=2399

It looks like the TCP BTL fails the incoming connection, which then causes a frag mismatch size in OB1. Here's a stack trace:

Program terminated with signal SIGABRT, Aborted.
#0  0x0000003370632925 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2aaab959a700 (LWP 1394))]
(gdb) bt
#0  0x0000003370632925 in raise () from /lib64/libc.so.6
#1  0x0000003370634105 in abort () from /lib64/libc.so.6
#2  0x000000337062ba4e in __assert_fail_base () from /lib64/libc.so.6
#3  0x000000337062bb10 in __assert_fail () from /lib64/libc.so.6
#4  0x00002aaaaacdbaec in mca_pml_ob1_put_completion (frag=0x781380, 
    rdma_size=26666) at pml_ob1_recvreq.c:197
#5  0x00002aaaaacd80b6 in mca_pml_ob1_recv_frag_callback_fin (btl=0x6dd500, 
    tag=73 'I', des=0x2aaab9299080, cbdata=0x0) at pml_ob1_recvfrag.c:434
#6  0x00002aaaab4d0108 in mca_btl_tcp_endpoint_recv_handler (sd=24, flags=2, 
    user=0x739220) at btl_tcp_endpoint.c:893
#7  0x00002aaaab512393 in event_persist_closure (base=0x65c4e0, ev=0x739510)
    at event.c:1321
#8  0x00002aaaab5124a2 in event_process_active_single_queue (base=0x65c4e0, 
    activeq=0x65c9d0) at event.c:1365
#9  0x00002aaaab51276f in event_process_active (base=0x65c4e0) at event.c:1440
#10 0x00002aaaab512dc2 in opal_libevent2022_event_base_loop (base=0x65c4e0, 
    flags=1) at event.c:1644
#11 0x00002aaaab4cbe89 in mca_btl_tcp_progress_thread_engine (
    obj=0x2aaaab8fc1c0 <mca_btl_tcp_progress_thread>)
    at btl_tcp_component.c:781
#12 0x0000003370a079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00000033706e8b6d in clone () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb) 

The configuration of this run was:

"CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --without-memory-manager

The corresponding output from the failed run was:

================ test_put6 ========== Sat Feb 25 03:08:56 2017
[mpi012:12152] btl: tcp: Incoming connection from 10.2.0.7 does not match known addresses for peer
[[28385,1],0]. Drop !
[mpi007:01397] btl: tcp: Incoming connection from 10.2.0.12 does not match known addresses for peer
[[28385,1],27]. Drop !
test_put6: pml_ob1_recvreq.c:197: mca_pml_ob1_put_completion: Assertion `(uint64_t) rdma_size ==
frag->rdma_length' failed.
[mpi007:01372] *** Process received signal ***
[mpi007:01372] Signal: Aborted (6)
[mpi007:01372] Signal code:  (-6)[mpi007:01372] [ 0] /lib64/libpthread.so.0[0x3370a0f710]
[mpi007:01372] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3370632925]
[mpi007:01372] [ 2] /lib64/libc.so.6(abort+0x175)[0x3370634105]
[mpi007:01372] [ 3] /lib64/libc.so.6[0x337062ba4e]
[mpi007:01372] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x337062bb10]
[mpi007:01372] [ 5]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libmpi.so.0(+0x22eaec)[0x2aaaaacdbaec]
[mpi007:01372] [ 6]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaacd80b6]
[mpi007:01372] [ 7]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0xca108)[0x2aaaab4d0108]
[mpi007:01372] [ 8]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c393)[0x2aaaab512393]
[mpi007:01372] [ 9]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c4a2)[0x2aaaab5124a2]
[mpi007:01372] [10]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0x10c76f)[0x2aaaab51276f]
[mpi007:01372] [11]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x298)[0x2aaaab512dc2]
[mpi007:01372] [12]
/home/mpiteam/scratches/community/2017-02-24cron/KnEw/installs/iidk/install/lib/libopen-pal.so.0(+0xc5e89)[0x2aaaab4cbe89]
[mpi007:01372] [13] /lib64/libpthread.so.0[0x3370a079d1]
[mpi007:01372] [14] /lib64/libc.so.6(clone+0x6d)[0x33706e8b6d]
[mpi007:01372] *** End of error message ***-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: mpi012
  Local PID:  12152
  Peer host:  mpi007
----------------------------------------------------------------------------------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi007 exited on signal 6 (Aborted).
--------------------------------------------------------------------------[mpi007:01352] 47 more
processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[mpi007:01352] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Note that 10.2.0.7 and 10.2.0.12 are valid IP addresses for MPI processes in this cluster, and actually correspond to the nodes that this SLURM MTT job was running on (mpi007 and mpi012). Here's an ifconfig from mpi002:

eth6      Link encap:Ethernet  HWaddr 24:57:20:02:50:00  
          inet addr:10.3.0.2  Bcast:10.3.255.255  Mask:255.255.0.0
          inet6 addr: fe80::2657:20ff:fe02:5000/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:543911395 errors:0 dropped:2341 overruns:0 frame:0
          TX packets:558728146 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:707847811000 (659.2 GiB)  TX bytes:779662168466 (726.1 GiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1198187857 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1198187857 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2382222734575 (2.1 TiB)  TX bytes:2382222734575 (2.1 TiB)

lom0      Link encap:Ethernet  HWaddr A4:4C:11:2A:72:68  
          inet addr:10.0.8.2  Bcast:10.0.255.255  Mask:255.255.0.0
          inet6 addr: fe80::a64c:11ff:fe2a:7268/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:32785162 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4297768 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7981061130 (7.4 GiB)  TX bytes:682224306 (650.6 MiB)
          Memory:cad00000-cae00000 

vic20     Link encap:Ethernet  HWaddr FC:99:47:25:2C:13  
          inet addr:10.10.0.2  Bcast:10.10.255.255  Mask:255.255.0.0
          inet6 addr: fe80::fe99:47ff:fe25:2c13/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:595480496 errors:0 dropped:0 overruns:0 frame:0
          TX packets:628872360 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1160229624365 (1.0 TiB)  TX bytes:1379139113686 (1.2 TiB)

vic21     Link encap:Ethernet  HWaddr FC:99:47:25:2C:14  
          inet addr:10.2.0.2  Bcast:10.2.255.255  Mask:255.255.0.0
          inet6 addr: fe80::fe99:47ff:fe25:2c14/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:508473520 errors:0 dropped:0 overruns:0 frame:0
          TX packets:528095836 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:526753524743 (490.5 GiB)  TX bytes:607706247980 (565.9 GiB)

Hence, 10.2.0.7 and 10.2.0.21 should both be valid IP addresses for MPI processes in this job.

I'm a little confused by what is happening here at the abort point in the code, because it's in the case for CONNECTED, so it must be complaining about some other fragment -- not the "bad" inbound connection.

@bosilca Can you have a look?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions