Description
Loading an entry term is expensive. Worst case it boils down to loading a full entry (which can be large), only to read its term.
Term checks are ubiquitous:
- Every commit on the leader does one.
- Every log append does a bunch.
- Every
StateProbe
matching between the leader and follower does a bunch.
We are hopefully saved by the raft log cache, but it doesn’t save us when some followers get behind. Also we’re interfering with other access patterns to this cache, and it would be nice to not use it for term checks since the latter can be done more efficiently.
The terms in the log are ordered, and typically have long runs like: t1 t1 t1 t1 t1 t2 t2 t2 t2
. We could maintain a short list / “index” next to the log: a sorted slice of term change positions. Then getting a term of a log entry boils down to a search in this list. This list can be in memory only, or backed by storage so that we don’t start cold. It will typically be short (<= 2 term flips) because a) the log is compacted quickly, b) leader changes are rare. In memory, we could also limit it to like <= 4 latest term flips.
Once this is done, another thing to do with it is reducing StateProbe
latency, e.g. post-election. Right now, StateProbe
handshake can be rejected with a hint that only communicates a term of a single raft log position. So, potentially, StateProbe
needs to do a few roundtrips before it converges to a fork point of the follower’s log and starts actually replicating. We can extend the hint: send a suffix of the term cache, so that this matching has a better chance to succeed in one go.
Thirdly, we can reduce post-election tail latency. Right now, when a leader is elected, it assumes all followers in StateProbe
. It appends a dummy entry to the log and broadcasts a single append to each follower. Only when a follower replies to this initial append, the leader exits StateProbe
and starts replicating actively. However, we can pipeline this probing into election: when voting, a voter can attach a short suffix of its term cache; the elected leader will likely immediately know the fork point of the follower's log and can enter StateReplicate
immediately (at least for the quorum whose votes it received).
Finally, this allows us to make StateProbe
messages empty. Today, we include a bunch of log entries into probe messages (optimistically, to reduce replication latency by 1 RTT). The problem is that StateProbe
messages are not subject to admission control, so this can lead to overload/OOM (when there are many ranges probing simultaneously). With faster/pipelined probing, it no longer needs to be done.
Jira issue: CRDB-44961
Activity