Skip to content

Why do endpoint tcp connections close? #501

Closed
@daire-byrne

Description

@daire-byrne

Hi,

You may remember me from such great issues as #430 and a few others. We are doing large rsync transfer using MPTCP over two ISP connections between our WAN offices.

Our setup is pretty much still as described in #430 (comment)

We have just started noticing that some long running transfers are dropping one of the TCP endpoints (as yet unknown) and I'm also wondering if there is a way to "repair" such a mptcp connection and get it to reconnect the lost subflow endpoints?

So here is an example of a transfer that starts out and is still fully connected (ss -neimMtp):

tcp   ESTAB      191605181 0        10.11.22.251:41255 10.29.22.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:1c cgroup:unreachable:1 <->
	 skmem:(r274369408,rb536870912,t0,tb14155776,f1152,w0,o0,bl0,d1030909) ts sack bbr wscale:14,14 rto:269 rtt:68.917/2.734 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_sent:232290716 bytes_retrans:126782 bytes_acked:232163935 bytes_received:1224254311597 segs_out:310662440 segs_in:908838936 data_segs_out:174058 data_segs_in:908751378 bbr:(bw:47704152bps,mrtt:67.229,pacing_gain:2.77344,cwnd_gain:2) send 329501052bps lastsnd:66634828 lastrcv:5659 lastack:5659 pacing_rate 833177824bps delivery_rate 10445664bps delivered:173967 app_limited busy:522243ms sndbuf_limited:198785ms(38.1%) retrans:0/99 dsack_dups:7 reordering:30 reord_seen:75 rcv_rtt:85.39 rcv_space:14480 rcv_ssthresh:268435456 minrtt:67.229 rcv_ooopack:317426596 snd_wnd:1409024 tcp-ulp-mptcp flags:Jjecv token:c9d86869(id:3)/43fdc29d(id:3) seq:57f1b2d182c5fbd3 sfseq:ffd1e3f9 ssnoff:f24128df maplen:fea0                                        
--
tcp   ESTAB      4671288   0        10.11.21.251:43925 10.29.21.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:3a cgroup:unreachable:1 <->
	 skmem:(r6750208,rb536870912,t0,tb14155776,f0,w0,o0,bl0,d1129860) ts sack bbr wscale:14,14 rto:280 rtt:79.634/18.357 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2053 bytes_sent:4903092302 bytes_retrans:650433 bytes_acked:4902441870 bytes_received:636988155127 segs_out:162347207 segs_in:479160311 data_segs_out:3790272 data_segs_in:477257848 bbr:(bw:53856432bps,mrtt:66.925,pacing_gain:2.77344,cwnd_gain:2) send 285853580bps lastsnd:3540438 lastrcv:67 lastack:67 pacing_rate 929224112bps delivery_rate 4297736bps delivered:3789831 app_limited busy:23571270ms sndbuf_limited:555262ms(2.4%) retrans:0/615 dsack_dups:182 reordering:46 reord_seen:1241 rcv_rtt:118.339 rcv_space:14480 rcv_ssthresh:268435456 minrtt:66 rcv_ooopack:181820223 snd_wnd:1409024 tcp-ulp-mptcp flags:Jjecv token:c9d86869(id:2)/43fdc29d(id:2) seq:57f1b2d191d7c4cb sfseq:4f2bdec8 ssnoff:a9a5a32c maplen:fea0                                           
--
tcp   ESTAB      0         0        10.11.20.251:41220 10.29.20.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:5e cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb69120,f0,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:273 rtt:72.312/20.944 ato:40 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:15 bytes_sent:188 bytes_acked:189 bytes_received:26 segs_out:13 segs_in:11 data_segs_out:5 data_segs_in:2 bbr:(bw:330448bps,mrtt:66.991,pacing_gain:2.77344,cwnd_gain:2) send 2300033bps lastsnd:138872225 lastrcv:138872292 lastack:138872117 pacing_rate 4534744bps delivery_rate 330344bps delivered:6 app_limited busy:310ms rcv_space:14480 rcv_ssthresh:65535 minrtt:66.991 snd_wnd:131072 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/43fdc29d(id:0) seq:57f1b121cf01829c sfseq:f ssnoff:223caa90 maplen:c                                                                                                                                                                                                                                                                                     
--
mptcp ESTAB      536900736 0        10.11.20.251:41220 10.29.20.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:1016 cgroup:unreachable:1 <->
	 skmem:(r536900736,rb536870912,t0,tb28384768,f2944,w0,o0,bl0,d223999) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:3 add_addr_signal_max:2 add_addr_accepted_max:3 remote_key token:43fdc29d write_seq:6c88e3ed89a23ea0 snd_una:6c88e3ed89a23ea0 rcv_nxt:57f1b2d182c6bacb

And then by comparison here's an rsync transfer that started out like above, but then something happened to drop some endpoints:

tcp   ESTAB      19087084  0        10.11.21.251:52205 10.21.21.251:873   users:(("rsync",pid=389561,fd=4),("rsync",pid=160553,fd=4)) ino:1010241 sk:34 cgroup:unreachable:1 <->
	 skmem:(r27906560,rb536870912,t0,tb14155776,f3584,w0,o0,bl0,d2350429) ts sack bbr wscale:14,14 rto:211 rtt:10.438/13.39 ato:80 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:271 ssthresh:222 bytes_sent:6580283182 bytes_retrans:1793620 bytes_acked:6578489563 bytes_received:1023019222134 segs_out:368384397 segs_in:776421314 data_segs_out:5112046 data_segs_in:773862276 bbr:(bw:1654946720bps,mrtt:0.759,pacing_gain:1.25,cwnd_gain:2.25) send 284137191bps lastsnd:362328 lastrcv:106 lastack:106 pacing_rate 1638397248bps delivery_rate 170334624bps delivered:5111185 app_limited busy:3606336ms sndbuf_limited:253022ms(7.0%) retrans:0/1958 dsack_dups:1096 reordering:47 reord_seen:559 rcv_rtt:119.856 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.724 rcv_ooopack:78628765 snd_wnd:393216 tcp-ulp-mptcp flags:Jjecv token:76276e2c(id:2)/ddef0e1a(id:2) seq:703fe0255718ed27 sfseq:2f8e891b ssnoff:6a8c1542 maplen:fb40                  
--
mptcp ESTAB      536868096 0        10.11.20.251:40532 10.21.20.251:873   users:(("rsync",pid=389561,fd=4),("rsync",pid=160553,fd=4)) ino:1010241 sk:1001 cgroup:unreachable:1 <->
	 skmem:(r536868096,rb536870912,t0,tb28384768,f2816,w0,o0,bl0,d90599) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:3 add_addr_signal_max:2 add_addr_accepted_max:3 remote_key token:ddef0e1a write_seq:b1ac0b4172ae433c snd_una:b1ac0b4172ae433c rcv_nxt:703fe0255719a997 

Namely the "backup" endpoint on ens192 and the second signal (bulk transfer) endpoint on ens256 have disconnected. The transfer continues on the remaining ens224 endpoint but at much reduced performance.

Now, I have also seen some cases where the two signal bulk transfers are still there, but the backup endpoint is gone. In the case of the disconnected backup connection, is it possible that it might drop out due to no data traversing it and the keepalive timeout killing it? Whatever the blip might have been, all routing and connectivity seems to be working again by the time I get to it.

One thing that does seem to happen is that all the running connections between two hosts seem to suffer this problem at the same time, so blips in the connectivity or maybe TCP memory exhaustion (we use large buffers) on one host could be the underlying root cause?

I'm not sure how best to debug the drops so any tips around that would be great. I'm also wondering if it's possible to "repair" such connections and force them to readd the endpoint connections? My quick searching did not throw up anything useful.

We are currently running v6.9.3.

Many thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions