Skip to content

ensure system tablets are alive when an entire dc is in a crash-loop #14323

Closed
@vporyadke

Description

@vporyadke

System tablets (specifically, coordinators and mediators) are kept in one datacenter when possible. Assume we are deploying a new problematic version, and at first rolled it out on one datacenter, causing all nodes there to become stuck in a crash-loop. Then, system tablets remain in this DC, are not working, and the entire database is effectively down, despite only one location having problems. This issue is for improving the behavior in this case.

  • Run simulations to see whether the "stick system tablets together" functionality is the cause of the problems, or if it would be the same with it turned off.
  • Make a fix based on the simulations

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions