diff --git a/raft/README.md b/raft/README.md index 5f1f81327970..160bd134bcb2 100644 --- a/raft/README.md +++ b/raft/README.md @@ -7,41 +7,68 @@ This library provides an efficient, extensible, implementation of Raft consensus algorithm for Seastar. For more details about Raft see https://raft.github.io/ +## Terminology + +Raft PhD is using a set of terms which are widely adopted in the +industry, including this library and its documentation. The +library provides replication facilities for **state machines**. +Thus the **state machine** here and in the source is a user +application, distributed by means of Raft. +The library is implemented in a way which allows to replace/ +plug in its key components: +- communication between peers by implementing **rpc** API +- persisting the library's private state on disk, + via **persistence** class +- shared failure detection by supplying a custom + **failure detector** class +- user state machine, by passing an instance of **state + machine** class. + +Please note that the library internally implements its own +finite state machine for protocol state - class fsm. This class +shouldn't be confused with the user state machine. + ## Implementation status --------------------- - log replication, including throttling for unresponsive servers -- leader election +- managing of the user's state machine snapshots +- leader election, including the pre-voting algorithm +- non-voting members (learners) support - configuration changes using joint consensus +- read barriers +- forwarding commands to the leader ## Usage ----- -In order to use the library the application has to provide implementations -for RPC, persistence and state machine APIs, defined in raft/raft.hh. The -purpose of these interfaces is: -- provide a way to communicate between Raft protocol instances -- persist the required protocol state on disk, -a pre-requisite of the protocol correctness, -- apply committed entries to the state machine. - -While comments for these classes provide an insight into -expected guarantees they should provide, in order to provide a complying -implementation it's necessary to understand the expectations -of the Raft consistency protocol on its environment: +In order to use the library, the application has to provide implementations +for RPC, persistence and state machine APIs, defined in `raft/raft.hh`, +namely: +- `class rpc`, provides a way to communicate between Raft protocol instances, +- `class persistence` persists the required protocol state on disk, +- class `state_machine` is the actual state machine being replicated. + +A complete description of expected semantics and guarantees +is maintained in the comments for these classes and in sample +implementations. Let's list here key aspects the implementor should +bear in mind: - RPC should implement a model of asynchronous, unreliable network, in which messages can be lost, reordered, retransmitted more than once, but not corrupted. Specifically, it's an error to - deliver a message to a Raft server which was not sent to it. + deliver a message to a wrong Raft server. - persistence should provide a durable persistent storage, which survives between state machine restarts and does not corrupt - its state. + its state. The storage should contain an efficient mostly-appended-to + part containing Raft log, thousands and hundreds of thousands of entries, + and a small register-like memory are to contain Raft + term, vote and the most recent snapshot descriptor. - Raft library calls `state_machine::apply_entry()` for entries reliably committed to the replication log on the majority of servers. While `apply_entry()` is called in the order - entries are serialized in the distributed log, there is + the entries were serialized in the distributed log, there is no guarantee that `apply_entry()` is called exactly once. - E.g. when a protocol instance restarts from the persistent state, + E.g. when a server restarts from the persistent state, it may re-apply some already applied log entries. Seastar's execution model is that every object is safe to use @@ -58,17 +85,19 @@ In a nutshell: - create instances of RPC, persistence, and state machine - pass them to an instance of Raft server - the facade to the Raft cluster on this node +- call server::start() to start the server - repeat the above for every node in the cluster - use `server::add_entry()` to submit new entries - on a leader, `state_machine::apply_entries()` is called after the added + `state_machine::apply_entries()` is called after the added entry is committed by the cluster. ### Subsequent usages -Similar to the first usage, but `persistence::load_term_and_vote()` -`persistence::load_log()`, `persistence::load_snapshot()` are expected to -return valid protocol state as persisted by the previous incarnation -of an instance of class server. +Similar to the first usage, but internally `start()` calls +`persistence::load_term_and_vote()` `persistence::load_log()`, +`persistence::load_snapshot()` to load the protocol and state +machine state, persisted by the previous incarnation of this +server instance. ## Architecture bits @@ -79,18 +108,19 @@ changes: it is possible to add and remove one or multiple nodes in a single transition, or even move Raft group to an entirely different set of servers. The implementation adopts the two-step algorithm described in the original Raft paper: -- first, an entry in the Raft log with joint configuration is -committed. The joint configuration contains both old and +- first, a log entry with joint configuration is +committed. The "joint" configuration contains both old and new sets of servers. Once a server learns about a new -configuration, it immediately adopts it. +configuration, it immediately adopts it, so as soon as +the joint configuration is committed, the leader will require two +majorities - the old one and the new one - to commit new entries. - once a majority of servers persists the joint entry, a final entry with new configuration is appended to the log. If a leader is deposed during a configuration change, -the new leader carries out the transition from joint -to final configuration for it. -it carries out the transition for the prevoius leader. +a new leader carries out the transition from joint +to the final configuration. No two configuration changes could happen concurrently. The leader refuses a new change if the previous one is still in progress.