Skip to content

Conversation

@matttbe
Copy link
Owner

@matttbe matttbe commented Jun 3, 2024

Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP enables seamless handovers between cellular and Wi-Fi networks while preserving established connections. This use-case is what pushed Apple to use MPTCP since 2013 in multiple applications [2]. On dual-stack hosts, Multipath TCP enables the TCP connection to automatically use the best performing path, either IPv4 or IPv6. If one path fails, MPTCP automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support it. Multipath TCP is a backward-compatible TCP extension that is enabled by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...). Multipath TCP is included in the Linux kernel since version 5.6 [3]. To use it on Linux, an application must explicitly enable it when creating the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to TCP, similar to what has been done with QUIC. By doing that, the size of __protocol_by_family has not been increased, and it behaves like TCP.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]

matttbe pushed a commit that referenced this pull request Jun 3, 2024
At least 3.9.0 version of libressl TLS stack does not behave as others stacks like quictls which
make SSL_do_handshake() return an error when no cipher could be negotiated
in addition to emit a TLS alert(0x28). This is the case when TLS_AES_128_CCM_SHA256
is forced as TLS1.3 cipher from the client side. This make haproxy enter a code
path which leads to a crash as follows:

[Switching to Thread 0x7ffff76b9640 (LWP 23902)]
0x0000000000487627 in quic_tls_key_update (qc=qc@entry=0x7ffff00371f0) at src/quic_tls.c:910
910             struct quic_kp_trace kp_trace = {
(gdb) list
905     {
906             struct quic_tls_ctx *tls_ctx = &qc->ael->tls_ctx;
907             struct quic_tls_secrets *rx = &tls_ctx->rx;
908             struct quic_tls_secrets *tx = &tls_ctx->tx;
909             /* Used only for the traces */
910             struct quic_kp_trace kp_trace = {
911                     .rx_sec = rx->secret,
912                     .rx_seclen = rx->secretlen,
913                     .tx_sec = tx->secret,
914                     .tx_seclen = tx->secretlen,
(gdb) p qc
$1 = (struct quic_conn *) 0x7ffff00371f0
(gdb) p qc->ael
$2 = (struct quic_enc_level *) 0x0
(gdb) bt
 #0  0x0000000000487627 in quic_tls_key_update (qc=qc@entry=0x7ffff00371f0) at src/quic_tls.c:910
 #1  0x000000000049bca9 in qc_ssl_provide_quic_data (len=268, data=<optimized out>, ctx=0x7ffff0047f80, level=<optimized out>, ncbuf=<optimized out>) at src/quic_ssl.c:617
 #2  qc_ssl_provide_all_quic_data (qc=qc@entry=0x7ffff00371f0, ctx=0x7ffff0047f80) at src/quic_ssl.c:688
 haproxy#3  0x00000000004683a7 in quic_conn_io_cb (t=0x7ffff0047f30, context=0x7ffff00371f0, state=<optimized out>) at src/quic_conn.c:760
 haproxy#4  0x000000000063cd9c in run_tasks_from_lists (budgets=budgets@entry=0x7ffff76961f0) at src/task.c:596
 haproxy#5  0x000000000063d934 in process_runnable_tasks () at src/task.c:876
 haproxy#6  0x0000000000600508 in run_poll_loop () at src/haproxy.c:3073
 haproxy#7  0x0000000000600b67 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
 haproxy#8  0x00007ffff7f6ae45 in start_thread () from /lib64/libpthread.so.0
 haproxy#9  0x00007ffff78254af in clone () from /lib64/libc.so.6

When a TLS alert is emitted, haproxy calls quic_set_connection_close() which sets
QUIC_FL_CONN_IMMEDIATE_CLOSE connection flag. This is this flag which is tested
by this patch to make the handshake fail even if SSL_do_handshake() does not
return an error. This test is specific to libressl and never run with
others TLS stack.

Thank you to @lgv5 and @botovq for having reported this issue in GH haproxy#2569.

Must be backported as far as 2.6.
matttbe pushed a commit that referenced this pull request Aug 20, 2024
This is the second attempt at importing the updated mt_list code (commit
59459ea3). The previous one was attempted with commit c618ed5 ("MAJOR:
import: update mt_list to support exponential back-off") but revealed
problems with QUIC connections and was reverted.

The problem that was faced was that elements deleted inside an iterator
were no longer reset, and that if they were to be recycled in this form,
they could appear as busy to the next user. This was trivially reproduced
with this:

  $ cat quic-repro.cfg
  global
          stats socket /tmp/sock1 level admin
          stats timeout 1h
          limited-quic

  frontend stats
          mode http
          bind quic4@:8443 ssl crt rsa+dh2048.pem alpn h3
          timeout client 5s
          stats uri /

  $ ./haproxy -db -f quic-repro.cfg  &

  $ h2load -c 10 -n 100000 --npn h3 https://127.0.0.1:8443/
  => hang

This was purely an API issue caused by the simplified usage of the macros
for the iterator. The original version had two backups (one full element
and one pointer) that the user had to take care of, while the new one only
uses one that is transparent for the user. But during removal, the element
still has to be unlocked if it's going to be reused.

All of this sparked discussions with Fred and Aurélien regarding the still
unclear state of locking. It was found that the lock API does too much at
once and is lacking granularity. The new version offers a much more fine-
grained control allowing to selectively lock/unlock an element, a link,
the rest of the list etc.

It was also found that plenty of places just want to free the current
element, or delete it to do anything with it, hence don't need to reset
its pointers (e.g. event_hdl). Finally it appeared obvious that the
root cause of the problem was the unclear usage of the list iterators
themselves because one does not necessarily expect the element to be
presented locked when not needed, which makes the unlock easy to overlook
during reviews.

The updated version of the list presents explicit lock status in the
macro name (_LOCKED or _UNLOCKED suffixes). When using the _LOCKED
suffix, the caller is expected to unlock the element if it intends to
reuse it. At least the status is advertised. The _UNLOCKED variant,
instead, always unlocks it before starting the loop block. This means
it's not necessary to think about unlocking it, though it's obviously
not usable with everything. A few _UNLOCKED were used at obvious places
(i.e. where the element is deleted and freed without any prior check).

Interestingly, the tests performed last year on QUIC forwarding, that
resulted in limited traffic for the original version and higher bit
rate for the new one couldn't be reproduced because since then the QUIC
stack has gaind in efficiency, and the 100 Gbps barrier is now reached
with or without the mt_list update. However the unit tests definitely
show a huge difference, particularly on EPYC platforms where the EBO
provides tremendous CPU savings.

Overall, the following changes are visible from the application code:

  - mt_list_for_each_entry_safe() + 1 back elem + 1 back ptr
    => MT_LIST_FOR_EACH_ENTRY_LOCKED() or MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
       + 1 back elem

  - MT_LIST_DELETE_SAFE() no longer needed in MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
      => just manually set iterator to NULL however.
    For MT_LIST_FOR_EACH_ENTRY_LOCKED()
      => mt_list_unlock_self() (if element going to be reused) + NULL

  - MT_LIST_LOCK_ELT => mt_list_lock_full()
  - MT_LIST_UNLOCK_ELT => mt_list_unlock_full()

  - l = MT_LIST_APPEND_LOCKED(h, e);  MT_LIST_UNLOCK_ELT();
    => l=mt_list_lock_prev(h); mt_list_lock_elem(e); mt_list_unlock_full(e, l)
matttbe pushed a commit that referenced this pull request Aug 20, 2024
Released version 3.1-dev3 with the following main changes :
    - BUG/MINOR: quic: Wrong datagram building when probing.
    - BUG/MEDIUM: quic: fix possible exit from qc_check_dcid() without unlocking
    - BUG/MINOR: promex: Remove Help prefix repeated twice for each metric
    - DOC: configuration: add details about crt-store in bind "crt" keyword
    - BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work with applet's buffers
    - DOC: configuration: more details about the master-worker mode
    - BUG/MEDIUM: server: fix race on server_atomic_sync()
    - BUG/MINOR: jwt: don't try to load files with HMAC algorithm
    - CLEANUP: quic: cleanup prototypes related to CIDs handling
    - CLEANUP: quic: remove non-existing quic_cid_tree definition
    - MINOR: quic: remove access to CID global tree outside of quic_cid module
    - REORG: quic: remove quic_cid_trees reference from proto_quic
    - MINOR: quic: add 2 BUG_ON() on datagram dispatch
    - MINOR: quic: ensure quic_conn is never removed on thread affinity rebind
    - MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
    - DOC: configuration: update maxconn description
    - MINOR: proto: extend connection thread rebind API
    - BUG/MEDIUM: quic: prevent crash on accept queue full
    - BUG/MEDIUM: peers: Fix crash when syncing learn state of a peer without appctx
    - CI: add weekly QUIC Interop regression against LibreSSL
    - DEV: flags/quic: decode quic_conn flags
    - MINOR: quic: rename "ssl error" trace
    - BUG/MEDIUM: init: fix fd_hard_limit default in compute_ideal_maxconn
    - BUG/MINOR: jwt: fix variable initialisation
    - MINOR: ssl/sample: ssl_c_san returns a comma separated list of SAN
    - OPTIM: pool: improve needed_avg cache line access pattern
    - MAJOR: import: update mt_list to support exponential back-off (try #2)
    - CI: weekly QUIC Interop: try to fix private image
    - BUG/MINOR: h1: Fail to parse empty transfer coding names
    - BUG/MINOR: h1: Reject empty coding name as last transfer-encoding value
    - BUG/MEDIUM: h1: Reject empty Transfer-encoding header
    - BUG/MEDIUM: spoe: Be sure to create a SPOE applet if none on the current thread
    - BUILD: listener: silence a build warning about unused value without threads
    - DOC: architecture: remove the totally outdated architecture manual
    - SCRIPTS: create-release: no more need to skip architecture.txt
matttbe pushed a commit that referenced this pull request Aug 20, 2024
…handling)

Now that mt_list v2 api was merged into haproxy's codebase in 4e65fc6 ("
MAJOR: import: update mt_list to support exponential back-off (try #2)"),
let's fix a hack in cli_parse_delete_server() which abused from mt_list
api to migrate an element from one list to another: there used to be a
tiny race there between the pop and the append operations, race that was
compensated by the fact that it was performed under full thread isolation.

However that was a bad example of the mt_list API which could have
resulted in actual bug if the code was duplicated elsewhere without thread
isolation. To fix this, we now make use of the
MT_LIST_FOR_EACH_ENTRY_LOCKED() macro which allows us to simply migrate
the current element to another list since the element is appended into
another one while still in busy state and then unlinked from the original
list.
@Aperence Aperence force-pushed the mptcp-addr branch 10 times, most recently from 84616a2 to 29979bc Compare August 23, 2024 07:38
Copy link
Owner Author

@matttbe matttbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aperence thanks! once last thing related to the code style. Then I think you are ready to send that :)

@Aperence Aperence force-pushed the mptcp-addr branch 9 times, most recently from 1196925 to 0b699c0 Compare August 26, 2024 07:43
Copy link
Owner Author

@matttbe matttbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aperence I'm not sure why there are only 2 commits left, but here is a quick preview. I guess you don't use a size of 8 char for one tab.

sk = str2sa_range(args[1], NULL, &port1, &port2, NULL, NULL, NULL,
&errmsg, NULL, NULL, PA_O_RESOLVE | PA_O_PORT_OK | PA_O_STREAM | PA_O_CONNECT);
&errmsg, NULL, NULL, NULL,
PA_O_RESOLVE | PA_O_PORT_OK | PA_O_STREAM | PA_O_CONNECT);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alignment doesn't look OK

@Aperence
Copy link
Collaborator

@Aperence I'm not sure why there are only 2 commits left, but here is a quick preview. I guess you don't use a size of 8 char for one tab.

The tests are now failing (all of them), so I'm trying to spot which commit inserted the error...

Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.

This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.
Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.

This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.
Use the protocol configured for a connection when creating the socket,
instead of always using 0.
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: haproxy#1028 [4]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants