Maintaining session consistency in the presence of forwarding #4401

eddyashton · 2022-10-25T15:39:16Z

eddyashton
Oct 25, 2022
Maintainer

CCF attempts to provide both session consistency and forwarding, but the interplay of these leads to a long-known bug: #3952

This discussion summarizes the reason both features are desired, the issues with their interaction, some previous attempts to fix this (and the reasons those failed), and the approach we plan to take at the time of writing.

Session consistency

We try to ensure that the history of responses that any user sees is non-contradictory. Specifically, over a single TLS session (which defines a clear ordering points), we want to guarantee that every response logically follows the response that came before - it does not describe an earlier state, nor does it describe a parallel state where the earlier response has been rolled back. We do this at the TLS layer, and for every connection, so that it is available as a global framework property. A similar result could be achieved in application space (ETag preconditions, monotonic request IDs, ...), but would only apply to application endpoints which opted in and may not generalise to new communication protocols. It should be both simpler to reason about, and faster to execute, if we can provide it within the framework.

Points of interest

Targeting the TLS session is an explicit choice, of most benefit to clients with direct control over their socket-level connections. Any pooling client will not benefit from this consistency, as subsequent requests may occur on an alternate or even reconnected session. For HTTP/2 and HTTP/3 support, we may look at providing consistency only within a single stream rather than across a TLS/QUIC session.
Deciding this within the framework enables consistency even for pipelined requests. Where performance targets require that multiple requests are transmitted before any response is processed (or even sent), it is not possible to use an ETags-style approach to insert an explicit dependency from each request to the previous. For maximal performance, the service should accept these pipelined requests and execute them in-order, guaranteeing that even on failure the resulting transactions are a complete, ordered prefix of the original requests.

Forwarding

CCF supports a limited form of forwarding, where write requests sent to a backup are transmitted by that backup to the primary for execution. This transmission happens over custom node-to-node channels also used for consensus traffic. The executing primary executed the request, produces a full response, and transmits that over the same reverse channel to the original backup, who then sends it over the original TLS session to the caller. This is currently completely opaque to the caller, and they do not know that forwarding has occurred. This allows users to speak to any node in the service, rather than needing to speak directly to the primary for writes. This simplifies user logic (speak to a single node, no knowledge of primary/backup distinction), and removes the requirement that all nodes are directly accessible by clients (allowing the service to be hosted behind a load balancer with a single public name). Forwarding decisions are based on the metadata associated with each endpoint, provided by the app developer - any request which is marked as may write will be forwarded (and any attempted write on a backup results in an error).

Points of interest

Similar behaviour could be achieved by HTTP redirects. Rather than forwarding, a node would return a redirect response pointing the client towards the primary. This would require that all nodes are publicly directly accessible. It also presents issues for load balancers with node stickiness - some clients may continually target a single node they were redirected to, rather than returning to the load balancer.
This forwarding could be used for more forms of work distribution in future. Rather than only forwarding writes to the primary, we could forward all historical queries to a dedicated node, or distribute work towards idle nodes with internal forwarding (and more precise workload information) rather than an external load balancer.
There is a window around elections where forwarding is not possible, as the node does not think there is any current primary, and then all write requests will return an error.

Maintaining consistency while forwarding

To maintain session consistency with forwarding, we implement a form of sticky forwarding. Once any request in a TLS session has been forwarded (because it may have written), all future requests on the same session will be forwarded, even when they are pure reads that could be handled locally. The common case that necessitates this is a write followed immediately by a read of the same key. To ensure the read returns the previous write, and not a stale value from before, the read must also be forwarded to the primary who executed the write. The backup who first receives this session will not be able to return that read until it has received the results over consensus traffic.

Points of interest

Forwarding is a node-local decision. A backup decides to forward based on its current opinion of the endpoint's metadata (may vary between code IDs, can trivially vary for dynamically defined endpoints such as jsgeneric uses), and its current opinion of the primary. Both of these things may differ between nodes, including different on the node who receives this forwarded message.
We do not forward multiple times. A node can distinguish a forwarded from an unforwarded request, and if it believes a forwarded request should be forwarded again (ie - it sees a write, is no longer the primary, but thinks it knows who is), it will instead return an error. This restriction is to avoid infinite loops but may be loosened in future.

Bugs and Previous Attempts

The current implementation has a known issue; session consistency is not maintained across elections. When a node rolls back its local state, it will continue to respond to new requests, potentially returning results which are inconsistent with the previous responses on the same session. We've had a few attempted fixes for this, with issues.

Errors if most recent TxID is INVALID

We could track every response's dependencies automatically by looking at the x-ms-ccf-transaction-id header. If we store this per-session, this precisely tracks the last thing we told that caller. After execution of each transaction, we can confirm that the previous TxID is still valid. If it has transitioned to INVALID we have seen a rollback that invalidated their previous state. We would also need to tell this TxID to the executing node when forwarding, so they could do the validity check (before they apply non-serial writes), and receive the new TxID in the forwarding response envelope (ideally without parsing it out of the pre-serialised HTTP response). This precisely detects inconsistencies, allowing sessions to span elections and do long-lived operations but preventing them seeing inconsistent state. When an inconsistency is detected, we could return a HTTP error and/or kill the current session. If we keep the session alive, we must continue to report errors for future transactions - they're still following rolled back state. Keeping the session alive biases towards smart clients, which are able to parse and handle the error appropriately, likely by restarting a new session. However it has poor behaviour for pooled clients, where arbitrary future requests may receive errors because they were unlucky and reused an inconsistent session. For this reason, we should always kill a session on consistency - preceded by an error response if possible but ultimately fatal.

Kill all sessions on rollback

We could kill all open TLS sessions whenever we rollback the local state. This avoids any per-request or even per-stream tracking, as we can immediately kill sessions during processing of an election. This is a pessimistic approach, killing all sessions which may exhibit inconsistency, in advance of them actually doing so (including sessions with no previous request, so no ability to be inconsistent). This contains a potential failure when a forwarding response races with a rollback on the primary. Essentially the primary may rollback before the backup does, yet still serve reads (or writes in a future term where it is reelected) for forwarded requests from that backup.

Node P is primary, node B is a backup
Node B gets a new client session C
C sends a write request W, B forwards it to P
P executes W, returns a response with TxID=2.10, responds to B who responds to C
P sees a new view before 2.10 is committed, and rolls back to 2.9. P closes all its own sessions with clients, because it might have caused lost the ability to respond consistently
Node B hasn't yet heard about this view increase, so is still serving C
C sends a read request R. B forwards it to P!
P executes R! Returns a response saying TxID=2.9!! Responds to B who responds to C!!
C observed 2.10, then 2.9

Serve responses from a single Term

This is the current plan. Rather than tracking each TxID, a session records the Term when it was first created. If this ever changes, it returns an error and then closes the session on the next request. We include this state when forwarding - the Term the backup expects the primary to be in with each forwarded request, and a boolean indicating error in the response if the primary was not. This retains each session until the next request, so we can report a HTTP error before killing the session - we believe this plays nicely with pooling client implementations. This is still pessimistic - the Term is grabbed extremely early, before the first request has even executed, so will report errors and kill sessions when there is no consistency break. Bias in this direction is considered acceptable - we are aiming to safely guarantee session consistency is always provided, and can look at providing it more efficiently in future.

achamayou · 2022-12-21T17:10:14Z

achamayou
Dec 21, 2022
Maintainer

Closed in #4595

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining session consistency in the presence of forwarding #4401

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Maintaining session consistency in the presence of forwarding #4401

eddyashton Oct 25, 2022 Maintainer

Session consistency

Points of interest

Forwarding

Points of interest

Maintaining consistency while forwarding

Points of interest

Bugs and Previous Attempts

Errors if most recent TxID is INVALID

Kill all sessions on rollback

Serve responses from a single Term

Replies: 1 comment

achamayou Dec 21, 2022 Maintainer

eddyashton
Oct 25, 2022
Maintainer

achamayou
Dec 21, 2022
Maintainer