Skip to content

fix: recover raft bootstrap from coordinator nodedown races#54

Merged
cpluss merged 5 commits intomainfrom
fix/coordinator-nodedown-race
Feb 26, 2026
Merged

fix: recover raft bootstrap from coordinator nodedown races#54
cpluss merged 5 commits intomainfrom
fix/coordinator-nodedown-race

Conversation

@cpluss
Copy link
Collaborator

@cpluss cpluss commented Feb 25, 2026

Summary

  • make Raft bootstrap convergence-driven instead of one-shot by adding retry scheduling when startup sync is incomplete
  • trigger sync retries on write-node nodeup and keep startup completion idempotent so repeated sync passes do not reset readiness state
  • harden coordinator group bootstrap by requiring visible replicas, surfacing bootstrap errors without crashing, and reusing a public RaftHealth.consensus_probe/0
  • add regression tests for retry scheduling, retry cancellation on convergence, and startup-complete idempotency

Verification

  • mix precommit

@cpluss
Copy link
Collaborator Author

cpluss commented Feb 26, 2026

Validation update on latest commit f16e103:

  • mix precommit passes locally (191 tests, 0 failures, 3 skipped).
  • CI checks are green:
    • build-and-push
    • test
    • test
  • Local docker integration validation completed after explicit image build:
    • docker compose -f docker-compose.integration.yml -p raft-local build
    • docker compose -f docker-compose.integration.yml -p raft-local up -d
  • Fresh bring-up converged:
    • /health/ready on node1 returned 200 {"status":"ok","mode":"write_node"}.
    • Ops.status() on node1/node2/node3 all reported local_ready: true, with all three nodes present in ready_nodes and raft_ready_nodes.
  • Restart drill passed:
    • Restarted node1 and reconverged to all three nodes ready.
  • RYW check passed:
    • Write on node1 (create session + append seq=1).
    • Immediate reads on node2 and node3 saw the appended event (seq=1, payload text "ryw-check").

Branch is clean and fully pushed.

@cpluss cpluss merged commit a8c68e4 into main Feb 26, 2026
3 checks passed
@cpluss cpluss deleted the fix/coordinator-nodedown-race branch February 26, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant