Skip to content

Conversation

@Ant1-Provot
Copy link

@Ant1-Provot Ant1-Provot commented Nov 19, 2025

Problem

When running very long migrations (e.g., large ALTER TABLE operations on tables with several billions of rows to bump INTEGER to BIGINTER), the process executing pgroll may crash due to various reasons (OOM, connection timeouts, infrastructure issues, etc.).

Here's my file

operations:
  - alter_column:
      table: mytable
      column: id
      type: integer
      up: CAST(id AS bigint)
      down: CAST(id AS integer)

Currently, if a migration is interrupted mid-execution, the only available option is to:

  • Abort the failed migration
  • Restart the entire migration from scratch

For large tables, this can mean hours or even days of wasted work, as the migration must be completely re-run even if it was 95% complete when it crashed.

Proposed Solution

This PR introduces a proof-of-concept for allowing migrations to resume from where they left off after a crash. The implementation currently focuses on ALTER TABLE operations as a starting point.
The approach detects when a migration was interrupted and provides a mechanism to continue the operation rather than starting over, significantly reducing the time required to recover from failures during long-running migrations.

Testing

No proper testing at that point yet but :

  • start a migration that takes long enough for you to have time to kill it with a signal or a ctrl + C.
  • start again

Meanwhile you can query your db SELECT COUNT(*) FROM "public"."instances" WHERE _pgroll_needs_backfill = false and see the count getting bumped each time by the number of rows processed in each batch.

Questions for Maintainers

Does this approach align with pgroll's architecture and design goals?

  • Are there specific edge cases or failure modes I should consider?
  • Would you prefer this functionality to be opt-in, or enabled by default?
  • Should this be expanded to cover other operation types beyond ALTER TABLE?

I'm happy to iterate on this implementation based on your feedback and requirements, I'm keeping this PR on draft meawhile.

Cheers and thanks for your tool !

ant1

@github-actions github-actions bot temporarily deployed to Docs Preview November 19, 2025 16:19 Inactive
@Ant1-Provot Ant1-Provot changed the title Allow long lasting migrations to restart from previous state in case of long lasting ones that would crash Allow long lasting migrations to restart from previous state in case of a crash of the previous one Nov 19, 2025
@Ant1-Provot Ant1-Provot marked this pull request as draft November 20, 2025 10:11
@Ant1-Provot Ant1-Provot marked this pull request as ready for review November 27, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant