Skip to content

[Postgres] Resumeable initial replication #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Dec 5, 2024
Merged

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Dec 4, 2024

This removes reliance on a Postgres snapshot for initial replication, allow replication to be resumed after being interrupted.

Initial replication could be interrupted because of any of the following reasons, among others:

  1. Connection failure.
  2. Replication process crashes, for example due to an out-of-memory error.
  3. Replication process restarts, for example to a deploy or migration to a different node.
  4. We've observed connection timeouts if a query cursor is older than 5 minutes.

This allows keeping existing progress in those cases, and gracefully resume replication.

Currently this can efficiently skip tables already replicated completely when resuming. Tables partially replicated will skip persistence of rows already replicated, but does still have a lot of overhead for reading those rows. On my machine with a local setup, it takes around 80ms to persist a new batch, and 20ms to skip a batch previously replicated (batch size = 2000).

In theory, this limits us to an absolute max of 30 million rows per table if cursors time out after 5 minutes. In practice there are other overheads, so this will probably support closer to 10 million rows with that timeout. Without the 5-minute timeout, there is no hard limit.

Copy link

changeset-bot bot commented Dec 4, 2024

🦋 Changeset detected

Latest commit: 0f0f3ca

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 6 packages
Name Type
@powersync/service-module-postgres Minor
@powersync/service-core Minor
@powersync/service-image Minor
@powersync/service-module-mongodb Patch
@powersync/service-module-mysql Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner force-pushed the resume-initial-replication branch from 5b8bb9b to 50096e0 Compare December 4, 2024 14:27
@rkistner rkistner marked this pull request as ready for review December 5, 2024 07:44
@rkistner rkistner merged commit 62e97f3 into main Dec 5, 2024
15 checks passed
@rkistner rkistner deleted the resume-initial-replication branch December 5, 2024 16:50
@rkistner
Copy link
Contributor Author

rkistner commented Dec 6, 2024

Results from testing a table with 4M records from Supabase:

Replicates for 5 minutes at a time; then takes 6 minutes to time out. Replication attempts finished at this number of rows replicated:

  1. 1163384
  2. 1265540 (may not accurately reflect the full total)
  3. 1997126
  4. 2679071
  5. Done

There is generally 11 minutes between the start of each batch. Total time to replicate was 56 minutes.

Future work:

  1. Reduce the connection timeout to around 30 seconds.
  2. Use a postgres cursor for the query, which may reduce how often we run into the 5-minute timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants