Skip to content

raft: introduce term cache #136296

Open
Open
@pav-kv

Description

Loading an entry term is expensive. Worst case it boils down to loading a full entry (which can be large), only to read its term.

Term checks are ubiquitous:

  • Every commit on the leader does one.
  • Every log append does a bunch.
  • Every StateProbe matching between the leader and follower does a bunch.

We are hopefully saved by the raft log cache, but it doesn’t save us when some followers get behind. Also we’re interfering with other access patterns to this cache, and it would be nice to not use it for term checks since the latter can be done more efficiently.

The terms in the log are ordered, and typically have long runs like: t1 t1 t1 t1 t1 t2 t2 t2 t2. We could maintain a short list / “index” next to the log: a sorted slice of term change positions. Then getting a term of a log entry boils down to a search in this list. This list can be in memory only, or backed by storage so that we don’t start cold. It will typically be short (<= 2 term flips) because a) the log is compacted quickly, b) leader changes are rare. In memory, we could also limit it to like <= 4 latest term flips.

Once this is done, another thing to do with it is reducing StateProbe latency, e.g. post-election. Right now, StateProbe handshake can be rejected with a hint that only communicates a term of a single raft log position. So, potentially, StateProbe needs to do a few roundtrips before it converges to a fork point of the follower’s log and starts actually replicating. We can extend the hint: send a suffix of the term cache, so that this matching has a better chance to succeed in one go.

Thirdly, we can reduce post-election tail latency. Right now, when a leader is elected, it assumes all followers in StateProbe. It appends a dummy entry to the log and broadcasts a single append to each follower. Only when a follower replies to this initial append, the leader exits StateProbe and starts replicating actively. However, we can pipeline this probing into election: when voting, a voter can attach a short suffix of its term cache; the elected leader will likely immediately know the fork point of the follower's log and can enter StateReplicate immediately (at least for the quorum whose votes it received).

Finally, this allows us to make StateProbe messages empty. Today, we include a bunch of log entries into probe messages (optimistically, to reduce replication latency by 1 RTT). The problem is that StateProbe messages are not subject to admission control, so this can lead to overload/OOM (when there are many ranges probing simultaneously). With faster/pipelined probing, it no longer needs to be done.

Jira issue: CRDB-44961

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)C-performancePerf of queries or internals. Solution not expected to change functional behavior.T-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions