Skip to content

Add powersync_replication_lag_seconds metric #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 3, 2025
Merged

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented May 28, 2025

Record replication lag in the logs and as a new metric, to help diagnose and alert on replication delays.

New metric:

# HELP powersync_replication_lag_seconds Replication lag between the source database and PowerSync instance
# TYPE powersync_replication_lag_seconds gauge
powersync_replication_lag_seconds 15

New logs on commit:

info: powersync_15 Flushed 2 + 0 + 2 updates, 1kb in 5ms. Last op_id: 99. Replication lag: 2s {"flushed":{"bucket_data_count":2,"current_data_count":2,"duration":5,"parameter_data_count":0,"replication_lag_seconds":2,"size":1002}}

Note that there is some difference between the logs versus the metric:

  1. The logs contain the delay for the transaction(s) being committed - the time between commit in the source db, and creating a new checkpoint in the powersync instance.
  2. The metric is a sample of the current delay - the time elapsed since the oldest pending transaction has been committed to the source db. The metric is always 0 when there are no pending changes.

We generally use the difference between the source database timestamps and the powersync instance time to calculate the lag. If the time for one of them is out, that will cause a constant offset to the replication lag calculation. This also means the replication lag could be reported as negative.

If the active sync rules is in an error state, we report the time since it could last persist a change.
If there are no active sync rules (e.g. sync rules was deployed for the first time), the metric is not reported.


The initial implementation here used a normal gauge instead of an observable one (i.e. the value was set at specific points, rather than polled). I extended the metrics implementation to add a plain Gauge interface. Since then I've found that the observable gauge approach is just better, and switched to that, removing the Gauge interface again. Replication lag is better calculated when you measure it, rather than at specific events, and I think the same applies to most gauges.

@rkistner rkistner requested a review from Copilot May 28, 2025 08:42
Copy link

changeset-bot bot commented May 28, 2025

🦋 Changeset detected

Latest commit: 7ef0fbd

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 13 packages
Name Type
@powersync/service-module-postgres-storage Minor
@powersync/service-module-mongodb-storage Minor
@powersync/service-core-tests Minor
@powersync/service-module-postgres Minor
@powersync/service-module-mongodb Minor
@powersync/service-core Minor
@powersync/service-module-mysql Minor
@powersync/service-types Minor
@powersync/service-schema Minor
@powersync/service-image Minor
@powersync/service-module-core Patch
test-client Patch
@powersync/lib-service-postgres Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a new replication lag metric for PowerSync, tracking the delay in seconds between the source database and PowerSync instance. It adds logging enhancements to include replication lag, updates method names for clarity, and revises the sync rules storage structure by adding an “active” flag.

  • Renamed API methods (e.g. getReplicationLag to getReplicationLagBytes) to clarify what the method returns.
  • Updated replication lag computation in multiple modules (Postgres, MySQL, MongoDB) with new internal state handling.
  • Enhanced logging messages to include replication lag data and improved sync rules tracking.

Reviewed Changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/service-core/src/api/RouteAPI.ts Renamed lag API method to reflect units (bytes).
packages/service-core-tests/src/test-utils/general-utils.ts Added active flag in test sync rule content.
modules/module-postgres/src/replication/WalStreamReplicator.ts Added getReplicationLagMillis method with fallback using last commit/keepalive timestamps.
modules/module-postgres/src/replication/WalStream.ts Integrated tracking of oldest uncommitted change for lag computation.
modules/module-postgres/src/replication/WalStreamReplicationJob.ts Propagated last stream instance for lag reporting.
modules/module-mysql/src/replication/BinLogStream.ts & BinLogReplicator.ts Introduced lag tracking methods and state updates in replication flows.
modules/module-mongodb/src/replication/*.ts Enhanced ChangeStream logic and lag tracking for MongoDB replication.
modules/module-/src/storage/ Extended flush/commit logic to log replication lag and added “active” sync rule states.
Comments suppressed due to low confidence (3)

packages/service-core/src/api/RouteAPI.ts:50

  • Ensure the naming of this method clearly reflects the unit it returns. Since the new metric is reported in seconds, verify that 'Bytes' is the intended descriptor, or consider renaming to avoid any ambiguity.
getReplicationLagBytes(options: ReplicationLagOptions): Promise<number | undefined>;

modules/module-postgres-storage/src/storage/batch/PostgresBucketBatch.ts:316

  • Review the change in the return value when no persisted operations exist; ensure that returning true aligns with downstream logic expecting a successful commit.
return true;

modules/module-mongodb/src/replication/ChangeStream.ts:765

  • [nitpick] Verify that resetting 'oldestUncommittedChange' and 'isStartingReplication' after a successful commit covers all edge cases, ensuring that lag calculations remain accurate.
const didCommit = await batch.commit(lsn, { oldestUncommittedChange: this.oldestUncommittedChange });

@rkistner rkistner force-pushed the record-replication-lag branch from 509678a to e3d3671 Compare May 28, 2025 08:48
@rkistner rkistner force-pushed the record-replication-lag branch from 3c32bbe to bfeb4af Compare May 28, 2025 13:46
Copy link
Contributor

@Rentacookie Rentacookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

@rkistner rkistner merged commit 0ccd470 into main Jun 3, 2025
21 checks passed
@rkistner rkistner deleted the record-replication-lag branch June 3, 2025 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants