Allow long lasting migrations to restart from previous state in case of a crash of the previous one #1000
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When running very long migrations (e.g., large ALTER TABLE operations on tables with several billions of rows to bump INTEGER to BIGINTER), the process executing
pgrollmay crash due to various reasons (OOM, connection timeouts, infrastructure issues, etc.).Here's my file
Currently, if a migration is interrupted mid-execution, the only available option is to:
For large tables, this can mean hours or even days of wasted work, as the migration must be completely re-run even if it was 95% complete when it crashed.
Proposed Solution
This PR introduces a proof-of-concept for allowing migrations to resume from where they left off after a crash. The implementation currently focuses on ALTER TABLE operations as a starting point.
The approach detects when a migration was interrupted and provides a mechanism to continue the operation rather than starting over, significantly reducing the time required to recover from failures during long-running migrations.
Testing
No proper testing at that point yet but :
starta migration that takes long enough for you to have time to kill it with a signal or a ctrl + C.startagainMeanwhile you can query your db
SELECT COUNT(*) FROM "public"."instances" WHERE _pgroll_needs_backfill = falseand see the count getting bumped each time by the number of rows processed in each batch.Questions for Maintainers
Does this approach align with pgroll's architecture and design goals?
I'm happy to iterate on this implementation based on your feedback and requirements, I'm keeping this PR on draft meawhile.
Cheers and thanks for your tool !
ant1