Skip to content

[MongoDB] Fix replication batching #271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 28, 2025
Merged

[MongoDB] Fix replication batching #271

merged 4 commits into from
May 28, 2025

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented May 27, 2025

Background

When replicating from a MongoDB database, we use the _powersync_checkpoints collection for multiple purposes:

  1. To detect the end of a transaction.
  2. To create write checkpoints.
  3. To dynamically "batch" updates from multiple transactions efficiently.

This change specifically concerns the last point - batching. The process works as follows:

  1. Whenever we receive a change, and haven't started a "batch" yet, we create a new "checkpoint" document.
  2. We wait for that document to be present in the change stream.
  3. Once we get that document, we flush/commit the changes.

On a mostly-idle database, we should get that checkpoint document back almost immediately, triggering a flush ASAP. While on very busy databases where a replication lag builds up, we only get that document back once we've "caught up" on replication since it was created. That will increase the number of documents in the batch, increasing throughput, and allow us to catch up faster.

The issue

Now, the issue comes in when we connect multiple PowerSync databases to the same source database: They share the same _powersync_checkpoints collection. While it is not an issue under low load, we've observed an issue during initial replication:

  1. Instance A is busy with initial replication, which adds a delay to each flush. Normally, as explained above, that would not be an issue - we'd flush less often, but still maintain high throughput.
  2. Instance B performs normal replication at a high rate. Since it has no significant load, it flushes the changes often, creating a new checkpoint document each time.

Now the issue is that instance A receives all the checkpoint documents from instance B, causing it to attempt a flush at the same rate. It can't keep up at that rate, so it falls behind and builds up a replication lag.

The fix

The fix here is to identify where the checkpoint documents originate, and ignore checkpoint documents from other instances.

For now, we still process all checkpoint documents created for write checkpoints. These are a little more difficult to filter out, due to the checkpoint originating from a different process. If these do become an issue, we can investigate filtering and/or throttling these later.

Replication Lag Metrics

To help diagnose issues like these, a new replication lag metric is added in #272


Testing

To test the issue locally, I ran two powersync instances with the same source database.

  1. First instance runs normally.
  2. In the second instance, I introduced an artificial delay of 2s in MongoBucketBatch.flushInner().
  3. Create a series of 10x small updates in the source database.

Before this change:

  1. The first instance processes each batch of changes quickly.
  2. The second instance also processes each batch of changes separately, but with the artificial delay. Replication lag keeps on growing (measured using Add powersync_replication_lag_seconds metric #272).

With this change:

  1. The first instance processes each batch of changes quickly.
  2. The second instance automatically uses larger batch sizes as the replication lag grows, keeping the replication lag around 2-4s.

Copy link

changeset-bot bot commented May 27, 2025

🦋 Changeset detected

Latest commit: 5b689f8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 11 packages
Name Type
@powersync/service-module-mongodb Patch
@powersync/service-core Patch
@powersync/service-image Patch
@powersync/service-schema Patch
@powersync/service-core-tests Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner changed the title [MongoDB] Fix replication batching / Replication lag metrics [MongoDB] Fix replication batching May 28, 2025
@rkistner rkistner force-pushed the fix-replication-batching branch from e535382 to 9370f38 Compare May 28, 2025 08:25
@rkistner rkistner marked this pull request as ready for review May 28, 2025 08:46
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes replication batching issues by adjusting how checkpoint documents are created and processed so that each PowerSync instance only processes its own checkpoints. Key changes include:

  • Removal of the redundant getReplicationHead method in the RouteAPI interface.
  • Updates to the createCheckpoint API across modules to include a new STANDALONE_CHECKPOINT_ID.
  • Improvements in the ChangeStream logic to filter out checkpoint events not originating from the current process.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/service-core/src/api/RouteAPI.ts Removed the getReplicationHead method to simplify the API.
modules/module-mongodb/test/src/change_stream_utils.ts Updated tests to use the new createCheckpoint signature with STANDALONE_CHECKPOINT_ID.
modules/module-mongodb/src/replication/MongoRelation.ts Modified createCheckpoint to accept an id parameter and added a new constant for standalone checkpoints.
modules/module-mongodb/src/replication/ChangeStream.ts Updated checkpoint handling in the ChangeStream to filter native and batch checkpoints properly.
modules/module-mongodb/src/api/MongoRouteAPIAdapter.ts Adjusted checkpoint creation to use STANDALONE_CHECKPOINT_ID.
.changeset/beige-clouds-cry.md Documented the patch releases for affected modules.
Comments suppressed due to low confidence (2)

modules/module-mongodb/src/replication/MongoRelation.ts:154

  • Typo in documentation: 'immeidately' should be corrected to 'immediately'.
 * Use this for write checkpoints, or any other case where we want to process the checkpoint immeidately, and not wait for batching.

modules/module-mongodb/src/replication/ChangeStream.ts:764

  • The expression 'this.checkpointStreamId.equals(this.checkpointStreamId)' will always return true. It is likely intended to compare 'this.checkpointStreamId' with 'checkpointId'.
if (!(checkpointId == STANDALONE_CHECKPOINT_ID || this.checkpointStreamId.equals(this.checkpointStreamId))) {

@rkistner rkistner marked this pull request as draft May 28, 2025 09:24
@rkistner rkistner marked this pull request as ready for review May 28, 2025 09:59
Copy link
Collaborator

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

@rkistner rkistner merged commit b57f938 into main May 28, 2025
20 checks passed
@rkistner rkistner deleted the fix-replication-batching branch May 28, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants