Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000

Ant1-Provot · 2025-11-19T16:19:27Z

Problem

When running very long migrations (e.g., large ALTER TABLE operations on tables with several billions of rows to bump INTEGER to BIGINTER), the process executing pgroll may crash due to various reasons (OOM, connection timeouts, infrastructure issues, etc.).

Here's my file

operations:
  - alter_column:
      table: mytable
      column: id
      type: integer
      up: CAST(id AS bigint)
      down: CAST(id AS integer)

Currently, if a migration is interrupted mid-execution, the only available option is to:

Abort the failed migration
Restart the entire migration from scratch

For large tables, this can mean hours or even days of wasted work, as the migration must be completely re-run even if it was 95% complete when it crashed.

Proposed Solution

This PR introduces a proof-of-concept for allowing migrations to resume from where they left off after a crash. The implementation currently focuses on ALTER TABLE operations as a starting point.
The approach detects when a migration was interrupted and provides a mechanism to continue the operation rather than starting over, significantly reducing the time required to recover from failures during long-running migrations.

Testing

No proper testing at that point yet but :

start a migration that takes long enough for you to have time to kill it with a signal or a ctrl + C.
start again

Meanwhile you can query your db SELECT COUNT(*) FROM "public"."instances" WHERE _pgroll_needs_backfill = false and see the count getting bumped each time by the number of rows processed in each batch.

Questions for Maintainers

Does this approach align with pgroll's architecture and design goals?

Are there specific edge cases or failure modes I should consider?
Would you prefer this functionality to be opt-in, or enabled by default?
Should this be expanded to cover other operation types beyond ALTER TABLE?

I'm happy to iterate on this implementation based on your feedback and requirements, I'm keeping this PR on draft meawhile.

Cheers and thanks for your tool !

ant1

…of long lasting ones that would crash

Ant1-Provot added 2 commits November 19, 2025 17:07

Allow long lasting migrations to restart from previous state in case …

13135af

…of long lasting ones that would crash

Removed useless file

241b33d

github-actions bot temporarily deployed to Docs Preview November 19, 2025 16:19 Inactive

Ant1-Provot changed the title ~~Allow long lasting migrations to restart from previous state in case of long lasting ones that would crash~~ Allow long lasting migrations to restart from previous state in case of a crash of the previous one Nov 19, 2025

Ant1-Provot marked this pull request as draft November 20, 2025 10:11

Ant1-Provot marked this pull request as ready for review November 27, 2025 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000

Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000

Uh oh!

Ant1-Provot commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000

Are you sure you want to change the base?

Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000

Uh oh!

Conversation

Ant1-Provot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed Solution

Testing

Questions for Maintainers

Does this approach align with pgroll's architecture and design goals?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ant1-Provot commented Nov 19, 2025 •

edited

Loading