Skip to content

Atomic migrations #9616

Closed
Closed

Description

Summary

Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes.

Problem Definition

In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin.
When running a migration, changes are written to the disk.

  • if we add a new store, then it's committed immediately
  • if someone will not use cache wrapped store with open and commit phase (eg because it would be too operational memory consuming).

If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.

Proposal

We need to find a more friendly mechanism to handle backup.

ADR-40 will make it easy because we use a DB level checkpoints. But that won't be implemented in 0.44.

Few proposals:

  1. flag to cosmovisor to copy the DB (that will basically double the disc requirements).
  2. add more options to migration
  3. filesystem level rollback based on journal
  4. implement ADR-40 based checkpoint mechanism in store - that would limit the number of supported databases and would require DB migration for many nodes.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions