Description
Problem
Currently, a raft
node has a Term
field (and its storage counterpart in HardState
) which is the latest term seen by this node.
- This term may or may not have a vote from this/other nodes, and may or may not have a leader.
- The local
raftLog
, correspondingly, may or may not be consistent with the leader's log at this term, and the leader's log might not exist in the first place (if the election at this term does not win).
This uncertainty is a source of workarounds and potential safety bugs, and is generally confusing. This also leads to unnecessary message rejects/drops in periods of leader change. To combat this, let's try to bring some structure, and then improve the codebase.
TODO: add action items.
TODO: add code links.
TODO: keep writing, this is a WIP text.
TODO: convert this to a tech note / design doc.
Background
Every leader in raft
has a unique term. During election, a prospective leader (candidate) negotiates a starting point (prevIndex, prevTerm)
, and then appends a contiguous run of entries at index prevIndex+1
and higher. The set of all entries in the system thus forms a tree. Raft implements a bunch of safety rules that govern how the (prevIndex, prevTerm)
point is chosen (it has to be in the subtree of the "committed" point in this tree), and when entries are considered committed (a quorum of replicas are in the same subtree).
Consider an example of such a "tree" representing the log evolution:
term
▲
│
│ ┌→5 5 5
│ │
│ ┌→4 4
│ │
│ ┌→3 3 3 3 3 3 3 3 3 3 3 3
│ │
│ │ ┌→2 2 2 2 2 2
│ │ │
│ 1 1 1 1 1 1
└──────────────────────────────────────────► index
Every node in the raft
group is at some "point" in this tree. A node's log is the unique path in this tree from (0, 0)
to entry (term, index)
at the "tip" of the log.
The "leader log" is the path from (0, 0)
to the latest entry that this leader appended. For the picture above, here are the "leader logs" corresponding to the leader term:
t1: 1 1 1 1 1 1
t2: 1 1 1 2 2 2 2 2 2
t3: 1 1 3 3 3 3 3 3 3 3 3 3 3 3
t4: 1 1 3 3 3 4 4
t5: 1 1 3 3 3 4 4 5 5 5
Note that:
- Two leaders can observe different entries at the same index. For example, the 3rd entry in
t1
has term 1, while the leadert4
observes an entry with term 3. - The same entry can be observed by multiple leaders. For example, the 4th entry is the same in
t3
,t4
, andt5
logs.
We have now established that there are two ways to identify entries in the system:
- The canonical unique ID is
(term, index)
, whereterm
identifies the leader who created this entry. - The alternative non-unique ID is
(term, index)
whereterm
identifies the leader who observes this entry in its log.
Different parts of raft
codebase use one of these two ways, and there is no clear distinction between these 2 semantics. Some code uses just index
and implicitly assumes the term
part to be either the original entry term, or the leader log term. Such implicit assumptions complicate understanding, and may lead to correctness bugs. We may need some type safety to communicate these semantics in code.
Notably, the raft.Term
field by default has no straightforward relation to term
in (1) and (2). In the Leader
state, it can be used with either definition; in the Candidate
state, raft.Term
is a number unrelated to any entries in the log; in the Follower
state, raft.Term
first has no relation to the log, but eventually becomes in sync with the log (when the first append message from the leader succeeds) and can be used with (2).
This suggests that raftLog
data structure (which is agnostic to the node state and election) should track a more straightforward Term
that can be used in coordinate system (2) unconditionally. A sensible approach is to say that raftLog.Term
is the term of the leader whose append we accepted last. This is equivalent to behaviour of Paxos acceptors in phase one - the acceptor keeps track of the term for the latest accepted proposal. In raft
world, the raftLog
is the equivalent of a Paxos acceptor.
Currently, the raft.Term
field plays two roles:
- The term for which an election is underway or already completed.
- The upper bound for the last accepted append term. I.e.
raft.Term >= raftLog.Term
at all times.
(2) makes it safe to use raft.Term
as a substitute for the raftLog.Term
for some safety checks. For example: the raftLog
must reject appends coming from leaders < raftLog.Term
. However, we currently reject terms < raft.Term
, which is a superset of < raftLog.Term
. We may reject appends unnecessarily (when raft.Term > raftLog.Term
), but we will never accept an append incorrectly, so this is safe.
TODO: keep writing