Description
The cluster state contains important metadata about the cluster, including what the mappings look like, what settings the indices have, which shards are allocated to which nodes, etc. Inconsistencies in the cluster state can have the most horrid consequences including inconsistent search results and data loss, and the job of the cluster state coordination subsystem is to prevent any such inconsistencies. Ideally this subsystem should also be easy to configure correctly and it should perform well in a variety of situations.
The goal of this project is to rebuild the cluster state coordination subsystem, making it more reliable, performant and user-friendly. Better reliability will be achieved by basing the core algorithm on strong theoretical underpinnings and extensive testing. Misconfiguration of the minimum_master_nodes
setting, one of the most common causes for cluster state inconsistencies, will be addressed by having this property fully managed by the system itself.
We've built a prototype to validate the approach and, based on our experience with this, present the following development roadmap for this new cluster coordination and consensus layer, targeting ES 7.0:
- core algorithm: Adds term and voting configuration to cluster state (Add term and config to cluster state #32100) and directly implements the transition rules from the spec (Add core coordination algorithm for cluster state publishing #32171)
- node discovery: builds list of peers based on UnicastHostProviders, establishes permanent connections and identifies active master nodes ([Zen2] Introduce gossip-like discovery of master nodes #32246, [Zen2] Add UnicastConfiguredHostsResolver #32642, [Zen2] Add HandshakingTransportAddressConnector #32643, [Zen2] Add PeerFinder#onFoundPeersUpdated #32939, Allow extension of CapturingTransport by subclasses #33012, Fix racy use of ConcurrentHashMap #36603)
- cluster state publication: Pipeline to publish a single cluster state to the other nodes in the cluster (Zen2: Cluster state publication pipeline #32584)
- deterministic / unit-testable MasterService (& async Discovery.publish method): increases testability of MasterService and the discovery layer (Zen2: Deterministic MasterService #32493)
- election scheduling ([Zen2] Introduce ElectionScheduler #32846) and prevoting ([Zen2] Introduce PreVoteCollector #32847), avoiding duelling elections
- leader election & lifecycle modes (Candidate, Leader, Follower): introduces the basic lifecycle states that a node can go through (Zen2: Add leader-side join handling logic #33013, [Zen2] Implement basic cluster formation #33668)
- node joining (voting and non-voting join), adding joining node to cluster state (Zen2: Add leader-side join handling logic #33013)
- leader & follower failure detector ([Zen2] Introduce LeaderChecker #33024, [Zen2] Introduce FollowersChecker #33917, Integrate LeaderChecker with Coordinator #34049, Zen2: Move NodeRemovalClusterStateTaskExecutor out of ZenDiscovery #34147)
- deterministic testing of cluster formation ([Zen2] Implement basic cluster formation #33668, Zen2: Add DisruptableMockTransport #33713, [Zen2] Logging improvements in CoordinatorTests #33991, [Zen2] Fix CoordinatorTests #34002, Fix CoordinatorTests some more #34039, [Zen2] Simulate scheduling delays #34181, [Zen2] Add safety phase to CoordinatorTests #34241)
- term bumping, ensuring all followers have voted for the leader ([Zen2] Gather votes from all nodes #34335, [Zen2] Fix bugs in fixLag() #34346)
- node leaving cluster on disconnect or failure (Zen2: Fail fast on disconnects #34503)
- auto-reconfiguration rules: provides the basis for the cluster to stay highly available by shifting votes from unavailable to available nodes. ([Zen2] Calculate optimal cluster configuration #33924, [Zen2] Reconfigure cluster as its membership changes #34592, [Zen2] Introduce auto_shrink_voting_configuration setting #35217)
- transport layer: transport actions & mock transport for unit testing (Zen2: Add DisruptableMockTransport #33713)
- storage layer: diff-based storage for the cluster state and current term (Zen2 ClusterState storage #33958)
- cluster state application (& acking): Apply a committed cluster state on each node and acknowledge that it has been applied (Zen2: Add Cluster State Applier #34257, [Zen2] Minor housekeeping of tests #34315)
- cluster bootstrap method to inject an initial state + configuration ([Zen2] Add low-level bootstrap implementation #34345, Introduce transport API for cluster bootstrapping #34961, [Zen2] Allow Setting a List of Bootstrap Nodes to Wait for #35847)
- voting configuration exclusions API, which allows to safely scale down a cluster from 2 nodes to 1 ([Zen2] Introduce vote withdrawal #35446, [Zen2] Implement Tombstone REST APIs #36007, Zen2: Rename voting tombstones to exclusions #36226)
- auto-bootstrapping and auto-scaling in integration tests based on bootstrap / voting configuration exclusions API (Zen2: Add infrastructure for integration tests #34365, [Zen2] Introduce vote withdrawal #35446, [Zen2] Introduce ClusterBootstrapService #35488, Zen2: Move most integration tests to Zen2 #35678, Zen2: Move disruption tests to Zen2 #35724)
- diff-based cluster state publishing (Zen2: Add diff-based publishing #35290, [Zen2] Fix test failures in diff-based publishing #35684)
- lag detection: remove nodes from cluster when they fall too far behind the master ([Zen2] Add lag detector #35685)
- state recovery / recover_after* settings ([Zen2] Implement state recovery #36013)
- support for (rolling) upgrades from 6.x (Zen2: Add basic Zen1 transport-level BWC #35443, [Zen2] Support rolling upgrades from Zen1 #35737)
- best-effort auto-bootstrapping on unconfigured discovery to provide good OOTB experience [Zen2] Best-effort cluster formation if unconfigured #36215
- correctly respect the
no_master_block
setting ([Zen2] Respect the no_master_block setting #36478) - Introduce deterministic task queue [Zen2] Introduce deterministic task queue #32197
- Randomized testing of CoordinationState [Zen2] Randomized testing of CoordinationState #32242
- Fix JoinTaskExecutor identity issue Zen2: Extract JoinTaskExecutor #32911
- Fix node logging issue Zen2: Add node id to log output of CoordinatorTests #33929
- Output voting tombstones in XContent representation of cluster state ([Zen2] VotingTombstone class #35832)
- Add
zen2
discovery type (Introducezen2
discovery type #36298) - integration with master-ineligible nodes ([Zen2] Only elect master-eligible nodes #35996, Zen2: Persist cluster states the old way on non-master-eligible nodes #36247)
- Full cluster restart upgrade, initial election does not use proper CS version (-> use metadata version instead) ([Zen2] Elect freshest master in upgrade #37122)
- Add restarts to CoordinatorTests (Mock connections more accurately in DisruptableMockTransport #37296)
- node join validation (Zen2: Add join validation #37203)
- Migrate Zen2 unit tests from InMemoryPersistedState to GatewayMetaState (Replace InMemoryPersistedState with GatewayMetaState in CoordinatorTests #36897)
- Relax bootstrapping to work on discovery of a quorum of the nodes, rather than on all of them. Use a placeholder ID for the unknown nodes. (Bootstrap a Zen2 cluster once quorum is discovered #37463)
- unsafe recover API / command line tool: To be used when a quorum of master-eligible nodes has been permanently lost ( Add tool elasticsearch-node unsafe-bootstrap #37696, Add elasticsearch-node detach-cluster tool #37979, Add elasticsearch-node tool docs #37812)
- handling dangling indices and nodes that were previously part of another cluster (Enforce cluster UUIDs #37775)
- Only have node as master that have active vote (Only bootstrap and elect node in current voting configuration #37712, Step down as master when configured out of voting configuration #37802)
- Additional bootstrapping methods? Check whether we have a good story for all typical deployment systems (docker, kubernetes, ...)
- Security model (voting exclusions API associated with
manage
cluster privilege) - Remove the need for
minimum_master_nodes
in a rolling upgrade, instead using theminimum_master_nodes
from the previous master for bootstrapping. ( Use m_m_nodes from Zen1 master for Zen2 bootstrap #37701) - Bubble exceptions up in ClusterApplierService (Bubble exceptions up in ClusterApplierService #37729)
- Prioritize publishing cluster state to master-eligible nodes (Publish cluster state to masters first #37673)
After 7.0 FF:
- Deprecate any Zen1-specific settings and rename any others that mention
zen
but which are still in use. (Deprecate unused Zen1 settings #38289,Rename static Zen1 settings #38333,Rename no-master-block setting #38350) - Make
discovery.type
non-configurable/internal-only / move Zen1 to tests only (Remove Zen1 #39466) - Scaling tests (e.g. election clashes when having large cluster states)
- Do not close bad indices on state recovery (Do not close bad indices on startup #39500)
- Add stats (e.g. expose stuff like node term, or discovery information while the node has troubles forming / joining a cluster) ([Zen2] Add warning if cluster fails to form fast enough #35993)
- Contemplate timeouts, retries, etc. and consider improvements to default values (Decrease leader and follower check timeout #38298)
- Check logged messages are useful and at the appropriate levels (Do not log unsuccessful join attempt each time #39756, Reduce logging noise when stepping down as master before state recovery #39950).
- Docs 📜 ([Zen2] Update documentation for Zen2 #34714, Move 'lost cluster state updates' issue to DONE #36959, [DOCS] Adds overview and API ref for cluster voting configurations #36954, Remove duplicate paragraph #36942, [DOCS] Merges list of discovery and cluster formation settings #36909) also docs for full-cluster and rolling upgrades
Post 7.0:
- Smoother master failovers by not exposing those to the ClusterApplierService, i.e., delay putting up a NO_MASTER_BLOCK.
- Abdicate on leader shutdown (appoint new leader)
- Add "has_voting_exclusions" flag to cluster health output (Add has_voting_exclusions flag to cluster health output #38568)
- Enqueueing cluster state updates to behave as well as possible in an overloaded cluster.
- Verify that a master which cannot write its cluster state stands down (or maybe actively abdicates)
- Deal appropriately with duplicate nodes (see e.g. NotMasterException with duplicate node ids and minimum_master_nodes not met #32904)
- High-level rest client integration for new APIs
- Avoid bootstrapping if any discovered peer has a nonzero term
- Work with support to enhance cluster diagnostics analysis tool.