Atomic migrations

## Summary

Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes. 

## Problem Definition

In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin. 
When running a migration, changes are written to the disk. 
+ if we add a new store, then it's committed immediately
+ if someone will not use cache wrapped store with open and commit phase (eg because it would be too operational memory consuming). 

If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.  

## Proposal

We need to find a more friendly mechanism to handle backup. 

ADR-40 will make it easy because we use a DB level checkpoints.  But that won't be implemented in 0.44.

Few proposals:
1. flag to cosmovisor to copy the DB  (that will basically double the disc requirements).
1. add more options to migration
1. filesystem level rollback based on journal
1. implement ADR-40 based checkpoint mechanism in store  - that would limit the number of supported databases and would require DB migration for many nodes. 

____

#### For Admin Use

- [x] Not duplicate issue
- [x] Appropriate labels applied
- [x] Appropriate contributors tagged
- [ ] Contributor assigned/self-assigned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic migrations #9616

robert-zaremba
openedon Jun 30, 2021

Summary

Problem Definition

Proposal

For Admin Use

Assignees

Labels

Type

Projects

Milestone

Relationships

Development