raft: introduce term cache

[Loading](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/kv/kvserver/logstore/logstore.go#L494-L563) an entry term is expensive. Worst case it boils down to loading a full entry (which can be large), only to read its term.

Term checks are ubiquitous:
- Every commit on the leader [does](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/raft/raft.go#L1031-L1039) one.
- Every log append [does](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/raft/log.go#L186-L225) a bunch.
- Every `StateProbe` [matching](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/raft/log.go#L227-L252) between the leader and follower does a bunch.

We are hopefully saved by the raft log [cache](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/kv/kvserver/raftentry/cache.go#L22-L78), but it doesn’t save us when some followers get behind. Also we’re interfering with other access patterns to this cache, and it would be nice to not use it for term checks since the latter can be done more efficiently.

The terms in the log are ordered, and typically have long runs like: `t1 t1 t1 t1 t1 t2 t2 t2 t2`. We could maintain a short list / “index” next to the log: a sorted slice of term **change** positions. Then getting a term of a log entry boils down to a search in this list. This list can be in memory only, or backed by storage so that we don’t start cold. It will typically be short (<= 2 term flips) because a) the log is compacted quickly, b) leader changes are rare. In memory, we could also limit it to like <= 4 latest term flips.

Once this is done, another thing to do with it is reducing `StateProbe` latency, e.g. post-election. Right now, `StateProbe` handshake can be [rejected](https://github.com/cockroachdb/cockroach/blob/a86b1274ef81e493f4cbdabe8b20d8977ab099f6/pkg/raft/raft.go#L2376-L2386) with a hint that only communicates a term of a single raft log position. So, potentially, `StateProbe` needs to do a few roundtrips before it converges to a fork point of the follower’s log and starts actually replicating. We can extend the hint: send a suffix of the term cache, so that this matching has a better chance to succeed in one go.

Thirdly, we can reduce post-election tail latency. Right now, when a leader is elected, it assumes all followers in `StateProbe`. It appends a dummy entry to the log and broadcasts a single append to each follower. Only when a follower replies to this initial append, the leader exits `StateProbe` and starts replicating actively. However, we can pipeline this probing into election: when voting, a voter can attach a short suffix of its term cache; the elected leader will likely immediately know the fork point of the follower's log and can enter `StateReplicate` immediately (at least for the quorum whose votes it received).

Finally, this allows us to make `StateProbe` messages empty. Today, we include a bunch of log entries into probe messages (optimistically, to reduce replication latency by 1 RTT). The problem is that `StateProbe` messages are not subject to admission control, so this can lead to overload/OOM (when there are many ranges probing simultaneously). With faster/pipelined probing, it no longer needs to be done.

Jira issue: CRDB-44961

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: introduce term cache #136296

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development