Closed
Description
Summary
Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes.
Problem Definition
In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin.
When running a migration, changes are written to the disk.
- if we add a new store, then it's committed immediately
- if someone will not use cache wrapped store with open and commit phase (eg because it would be too operational memory consuming).
If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.
Proposal
We need to find a more friendly mechanism to handle backup.
ADR-40 will make it easy because we use a DB level checkpoints. But that won't be implemented in 0.44.
Few proposals:
- flag to cosmovisor to copy the DB (that will basically double the disc requirements).
- add more options to migration
- filesystem level rollback based on journal
- implement ADR-40 based checkpoint mechanism in store - that would limit the number of supported databases and would require DB migration for many nodes.
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment