Maintaining session consistency in the presence of forwarding #4401
Closed
eddyashton
started this conversation in
Design
Replies: 1 comment
-
Closed in #4595 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CCF attempts to provide both session consistency and forwarding, but the interplay of these leads to a long-known bug: #3952
This discussion summarizes the reason both features are desired, the issues with their interaction, some previous attempts to fix this (and the reasons those failed), and the approach we plan to take at the time of writing.
Session consistency
We try to ensure that the history of responses that any user sees is non-contradictory. Specifically, over a single TLS session (which defines a clear ordering points), we want to guarantee that every response logically follows the response that came before - it does not describe an earlier state, nor does it describe a parallel state where the earlier response has been rolled back. We do this at the TLS layer, and for every connection, so that it is available as a global framework property. A similar result could be achieved in application space (ETag preconditions, monotonic request IDs, ...), but would only apply to application endpoints which opted in and may not generalise to new communication protocols. It should be both simpler to reason about, and faster to execute, if we can provide it within the framework.
Points of interest
Forwarding
CCF supports a limited form of forwarding, where write requests sent to a backup are transmitted by that backup to the primary for execution. This transmission happens over custom node-to-node channels also used for consensus traffic. The executing primary executed the request, produces a full response, and transmits that over the same reverse channel to the original backup, who then sends it over the original TLS session to the caller. This is currently completely opaque to the caller, and they do not know that forwarding has occurred. This allows users to speak to any node in the service, rather than needing to speak directly to the primary for writes. This simplifies user logic (speak to a single node, no knowledge of primary/backup distinction), and removes the requirement that all nodes are directly accessible by clients (allowing the service to be hosted behind a load balancer with a single public name). Forwarding decisions are based on the metadata associated with each endpoint, provided by the app developer - any request which is marked as may write will be forwarded (and any attempted write on a backup results in an error).
Points of interest
Maintaining consistency while forwarding
To maintain session consistency with forwarding, we implement a form of sticky forwarding. Once any request in a TLS session has been forwarded (because it may have written), all future requests on the same session will be forwarded, even when they are pure reads that could be handled locally. The common case that necessitates this is a write followed immediately by a read of the same key. To ensure the read returns the previous write, and not a stale value from before, the read must also be forwarded to the primary who executed the write. The backup who first receives this session will not be able to return that read until it has received the results over consensus traffic.
Points of interest
jsgeneric
uses), and its current opinion of the primary. Both of these things may differ between nodes, including different on the node who receives this forwarded message.Bugs and Previous Attempts
The current implementation has a known issue; session consistency is not maintained across elections. When a node rolls back its local state, it will continue to respond to new requests, potentially returning results which are inconsistent with the previous responses on the same session. We've had a few attempted fixes for this, with issues.
Errors if most recent TxID is INVALID
We could track every response's dependencies automatically by looking at the
x-ms-ccf-transaction-id
header. If we store this per-session, this precisely tracks the last thing we told that caller. After execution of each transaction, we can confirm that the previous TxID is still valid. If it has transitioned toINVALID
we have seen a rollback that invalidated their previous state. We would also need to tell this TxID to the executing node when forwarding, so they could do the validity check (before they apply non-serial writes), and receive the new TxID in the forwarding response envelope (ideally without parsing it out of the pre-serialised HTTP response). This precisely detects inconsistencies, allowing sessions to span elections and do long-lived operations but preventing them seeing inconsistent state. When an inconsistency is detected, we could return a HTTP error and/or kill the current session. If we keep the session alive, we must continue to report errors for future transactions - they're still following rolled back state. Keeping the session alive biases towards smart clients, which are able to parse and handle the error appropriately, likely by restarting a new session. However it has poor behaviour for pooled clients, where arbitrary future requests may receive errors because they were unlucky and reused an inconsistent session. For this reason, we should always kill a session on consistency - preceded by an error response if possible but ultimately fatal.Kill all sessions on rollback
We could kill all open TLS sessions whenever we rollback the local state. This avoids any per-request or even per-stream tracking, as we can immediately kill sessions during processing of an election. This is a pessimistic approach, killing all sessions which may exhibit inconsistency, in advance of them actually doing so (including sessions with no previous request, so no ability to be inconsistent). This contains a potential failure when a forwarding response races with a rollback on the primary. Essentially the primary may rollback before the backup does, yet still serve reads (or writes in a future term where it is reelected) for forwarded requests from that backup.
TxID=2.10
, responds to B who responds to CServe responses from a single Term
This is the current plan. Rather than tracking each TxID, a session records the Term when it was first created. If this ever changes, it returns an error and then closes the session on the next request. We include this state when forwarding - the Term the backup expects the primary to be in with each forwarded request, and a boolean indicating error in the response if the primary was not. This retains each session until the next request, so we can report a HTTP error before killing the session - we believe this plays nicely with pooling client implementations. This is still pessimistic - the Term is grabbed extremely early, before the first request has even executed, so will report errors and kill sessions when there is no consistency break. Bias in this direction is considered acceptable - we are aiming to safely guarantee session consistency is always provided, and can look at providing it more efficiently in future.
Beta Was this translation helpful? Give feedback.
All reactions