Skip to content

Commit

Permalink
raft: style edits of README.md.
Browse files Browse the repository at this point in the history
  • Loading branch information
kostja committed Sep 27, 2021
1 parent de2beac commit 0e63e99
Showing 1 changed file with 58 additions and 28 deletions.
86 changes: 58 additions & 28 deletions raft/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,41 +7,68 @@ This library provides an efficient, extensible, implementation of
Raft consensus algorithm for Seastar.
For more details about Raft see https://raft.github.io/

## Terminology

Raft PhD is using a set of terms which are widely adopted in the
industry, including this library and its documentation. The
library provides replication facilities for **state machines**.
Thus the **state machine** here and in the source is a user
application, distributed by means of Raft.
The library is implemented in a way which allows to replace/
plug in its key components:
- communication between peers by implementing **rpc** API
- persisting the library's private state on disk,
via **persistence** class
- shared failure detection by supplying a custom
**failure detector** class
- user state machine, by passing an instance of **state
machine** class.

Please note that the library internally implements its own
finite state machine for protocol state - class fsm. This class
shouldn't be confused with the user state machine.

## Implementation status
---------------------
- log replication, including throttling for unresponsive
servers
- leader election
- managing of the user's state machine snapshots
- leader election, including the pre-voting algorithm
- non-voting members (learners) support
- configuration changes using joint consensus
- read barriers
- forwarding commands to the leader

## Usage
-----

In order to use the library the application has to provide implementations
for RPC, persistence and state machine APIs, defined in raft/raft.hh. The
purpose of these interfaces is:
- provide a way to communicate between Raft protocol instances
- persist the required protocol state on disk,
a pre-requisite of the protocol correctness,
- apply committed entries to the state machine.

While comments for these classes provide an insight into
expected guarantees they should provide, in order to provide a complying
implementation it's necessary to understand the expectations
of the Raft consistency protocol on its environment:
In order to use the library, the application has to provide implementations
for RPC, persistence and state machine APIs, defined in `raft/raft.hh`,
namely:
- `class rpc`, provides a way to communicate between Raft protocol instances,
- `class persistence` persists the required protocol state on disk,
- class `state_machine` is the actual state machine being replicated.

A complete description of expected semantics and guarantees
is maintained in the comments for these classes and in sample
implementations. Let's list here key aspects the implementor should
bear in mind:
- RPC should implement a model of asynchronous, unreliable network,
in which messages can be lost, reordered, retransmitted more than
once, but not corrupted. Specifically, it's an error to
deliver a message to a Raft server which was not sent to it.
deliver a message to a wrong Raft server.
- persistence should provide a durable persistent storage, which
survives between state machine restarts and does not corrupt
its state.
its state. The storage should contain an efficient mostly-appended-to
part containing Raft log, thousands and hundreds of thousands of entries,
and a small register-like memory are to contain Raft
term, vote and the most recent snapshot descriptor.
- Raft library calls `state_machine::apply_entry()` for entries
reliably committed to the replication log on the majority of
servers. While `apply_entry()` is called in the order
entries are serialized in the distributed log, there is
the entries were serialized in the distributed log, there is
no guarantee that `apply_entry()` is called exactly once.
E.g. when a protocol instance restarts from the persistent state,
E.g. when a server restarts from the persistent state,
it may re-apply some already applied log entries.

Seastar's execution model is that every object is safe to use
Expand All @@ -58,17 +85,19 @@ In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster
on this node
- call server::start() to start the server
- repeat the above for every node in the cluster
- use `server::add_entry()` to submit new entries
on a leader, `state_machine::apply_entries()` is called after the added
`state_machine::apply_entries()` is called after the added
entry is committed by the cluster.

### Subsequent usages

Similar to the first usage, but `persistence::load_term_and_vote()`
`persistence::load_log()`, `persistence::load_snapshot()` are expected to
return valid protocol state as persisted by the previous incarnation
of an instance of class server.
Similar to the first usage, but internally `start()` calls
`persistence::load_term_and_vote()` `persistence::load_log()`,
`persistence::load_snapshot()` to load the protocol and state
machine state, persisted by the previous incarnation of this
server instance.

## Architecture bits

Expand All @@ -79,18 +108,19 @@ changes: it is possible to add and remove one or multiple
nodes in a single transition, or even move Raft group to an
entirely different set of servers. The implementation adopts
the two-step algorithm described in the original Raft paper:
- first, an entry in the Raft log with joint configuration is
committed. The joint configuration contains both old and
- first, a log entry with joint configuration is
committed. The "joint" configuration contains both old and
new sets of servers. Once a server learns about a new
configuration, it immediately adopts it.
configuration, it immediately adopts it, so as soon as
the joint configuration is committed, the leader will require two
majorities - the old one and the new one - to commit new entries.
- once a majority of servers persists the joint
entry, a final entry with new configuration is appended
to the log.

If a leader is deposed during a configuration change,
the new leader carries out the transition from joint
to final configuration for it.
it carries out the transition for the prevoius leader.
a new leader carries out the transition from joint
to the final configuration.

No two configuration changes could happen concurrently. The leader
refuses a new change if the previous one is still in progress.
Expand Down

0 comments on commit 0e63e99

Please sign in to comment.