[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1823

xiaom · 2022-12-12T04:58:12Z

Co-Authored-By: Yaroslav Tkachenko 260702+sap1ens@users.noreply.github.com

This is our first PR resolving #1163 😄

The core functionality of DataStream is implemented under the package com.ververica.cdc.connectors.postgres.source with a similar layout to MySQL/Oracle's incremental snapshotting implementation.

source
├── PostgresChunkSplitter.java
├── PostgresConnectionPoolFactory.java
├── PostgresDialect.java
├── PostgresSourceBuilder.java
├── config
│   ├── PostgresSourceConfig.java
│   ├── PostgresSourceConfigFactory.java
│   └── PostgresSourceOptions.java
├── fetch
│   ├── PostgresScanFetchTask.java
│   ├── PostgresSourceFetchTaskContext.java
│   └── PostgresStreamFetchTask.java
├── offset
│   ├── PostgresOffset.java
│   └── PostgresOffsetFactory.java
└── utils
    ├── PgQueryUtils.java
    ├── PgSchema.java
    ├── PgTypeUtils.java
    └── TableDiscoveryUtils.java

The corresponding Table API can be enabled by setting scan.incremental.snapshot.enabled=true.

A few notes:

package io.debezium.connector.postgresql

The package is mostly getting around the limitation of using some Debezium classes directly.

Utils.java: a utility class to access some package-private methods of Debezium
PostgresObjectFactory.java: a factory to create various Debezium object constructor which needs package private access
``PostgresConnection.java: copied from Debezium 1.6.4-final and modified to support injecting connection factory with Hikari connection pool.

Major changes to CDC-base

JdbcSourceConfig: add a new field List<String> schemaList and make it compatible with PostgreSQL and Debezium’s terminology. (see Debezium’s TableId).
DataSourceDialect: extend the CheckpointListener interface such that we can add a customized hook to commit offset
SourceSplitSerializer: fixed a deserialization bug and check the useCatalogBeforeSchema flag (true by default)

Other changes not related to PostgreSQL

Fix a bug where the high watermark is not set properly in various *ScanFetchTask

Notes on Approximate Count query for Postgres

We use the following query in chunk splitter to estimate the approximate count of rows

SELECT reltuples::bigint FROM pg_class WHERE oid = to_regclass('your_table_id')

The query requires a prior run of VACCUM or ANALYZE to get a good estimation. For any PostgreSQL instances with autovacuum on, you won’t need to worry about it.

We are also actively working on supporting the scan.newly-added-table.enabled feature as @sap1ens mentioned here.

Appreciate any feedback!

xiaom · 2022-12-16T18:02:05Z

Update:

fix some issues of JdbcSourceFetchTaskContext and set proper default implementation of several methods (see e040533)

leonardBang · 2023-02-07T03:48:42Z

@xiaom We planed to open a contributor sync meeting to discuss the 2.4 roadmap, are you interested to join? please contact me if you'd like to.

xiaom · 2023-02-08T05:59:55Z

Hey @leonardBang, thanks for the invitation! Yeah, I am interested. will DM you on Twitter.

1032851561 · 2023-04-10T08:54:20Z

Does it support snapshot the new added tables? I need this function, is it works good?

sap1ens · 2023-04-10T09:04:21Z

Does it support snapshot the new added tables? I need this function, is it works good?

There is a separate PR for that: #1838

1032851561 · 2023-04-10T09:52:43Z

Does it support snapshot the new added tables? I need this function, is it works good?

There is a separate PR for that: #1838

Nice, hope to merge as soon as possible.

ruanhang1993

@xiaom , thanks for your work. I left some comments.

flink-connector-postgres-cdc/src/main/java/io/debezium/connector/postgresql/Utils.java

...ector-postgres-cdc/src/main/java/io/debezium/connector/postgresql/PostgresObjectFactory.java

...dc/src/main/java/com/ververica/cdc/connectors/postgres/source/utils/TableDiscoveryUtils.java

...tgres-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/utils/PgQueryUtils.java

.../src/main/java/com/ververica/cdc/connectors/postgres/source/fetch/PostgresScanFetchTask.java

...rc/main/java/com/ververica/cdc/connectors/postgres/source/fetch/PostgresStreamFetchTask.java

.../java/com/ververica/cdc/connectors/postgres/source/fetch/PostgresSourceFetchTaskContext.java

ruanhang1993 · 2023-05-23T09:14:00Z

Hi, @xiaom.
Do you have time to rebase the master branch and make some updates?
Please @ me when you update this PR and need me to review. Thanks ~

xiaom · 2023-05-24T05:18:28Z

Hi @ruanhang1993, thanks for the review! I will find some time to update the PR either later this week or next week.

ruanhang1993 · 2023-06-05T03:19:19Z

Hi, @xiaom .

Is there any update about this PR?
We plan to release version 2.4.0 at June 14th. This feature is in this version.
If any update is pushed, I will review again as soon as possible.

Thanks a lot~

xiaom · 2023-06-05T06:25:20Z

Apologize for the delay in updating the PR. I've had some unexpected personal commitments come up.
I'll do my best to get this PR updated.

xiaom · 2023-06-07T07:46:49Z

Hey @ruanhang1993, I've addressed some comments in this commit for you to review. Let me know what you think. I have not rebased the branch yet. I will do it in the next update.

Also, I'd like to point out a caveat of this feature for any potential users: its scalability with large tables is not ideal.

In the snapshotting phase, backfill tasks are created to capture new data changes. However, for larger tables, since snapshotting takes longer, WAL also grows larger and backfilling tasks will take significantly more time.

Contrary to MySQL, where the process can be parallelized through additional binlog readers, this isn't straightforward for Postgres. To achieve similar parallelism, we would require more replication slots, a resource that is not advisable to overuse due to its limited availability.

In light of this, we implement a snapshot-only reader (with option snapshot.mode=initial_only) coupled with a stream reader (snapshot.mode=never) and some dedupe processes in our production environment. This approach allows us to parallelize snapshotting without increasing the number of replication slots. Just want to mention this in case anyone wants to use similar strategies.

ruanhang1993 · 2023-06-08T04:03:37Z

Hi, @xiaom .
Thanks for the quick reply. I have replied the unclear comments.

About the problem you mentioned, the snapshot phase for the big table is actually a common pain point.
Snapshot only + start from a specific binlog position is a good idea. But this way also have some limits.

If there is a single parallelism in snapshot phase, we have to set an enough long alive time for binlogs.
If there are multi parallelisms in snapshot phase, we can only provide the at least once semantic. The sink must support idempotent operations. And we also need to make sure the binlogs be alive.

The issue #1687 for mysql aims to the usage.

Co-Authored-By: Yaroslav Tkachenko <260702+sap1ens@users.noreply.github.com>

… Connector

xiaom · 2023-06-09T22:28:56Z

Hey @ruanhang1993,

I've rebased the PR.

Also, thanks for mentioning various solutions for parallelized snapshotting. Good to know that this is a common pain point.

ruanhang1993

Hi, @xiaom . I have reviewed the cdc-base and will review pg cdc part later.
Would you mind take a look at the failed CI ? Thanks ~

docs/content/connectors/postgres-cdc.md

ruanhang1993 · 2023-06-11T14:28:16Z

...va/com/ververica/cdc/connectors/base/source/assigner/state/PendingSplitsStateSerializer.java

@@ -358,6 +359,8 @@ private void writeTableIds(Collection<TableId> tableIds, DataOutputSerializer ou
        final int size = tableIds.size();
        out.writeInt(size);
        for (TableId tableId : tableIds) {
+            boolean useCatalogBeforeSchema = SerializerUtils.shouldUseCatalogBeforeSchema(tableId);
+            out.writeBoolean(useCatalogBeforeSchema);


This will make the state in 2.3.0 not be able to be used in 2.4.0.
We should update the state serializer version and use a different logic.

flink-cdc-base/src/main/java/com/ververica/cdc/connectors/base/utils/SerializerUtils.java

...com/ververica/cdc/connectors/base/source/reader/external/IncrementalSourceStreamFetcher.java

...ain/java/com/ververica/cdc/connectors/base/source/reader/IncrementalSourceRecordEmitter.java

.../java/com/ververica/cdc/connectors/base/source/meta/offset/OffsetDeserializerSerializer.java

flink-cdc-base/src/main/java/com/ververica/cdc/connectors/base/source/meta/offset/Offset.java

...stgres-cdc/src/main/java/io/debezium/connector/postgresql/connection/PostgresConnection.java

...es-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/offset/PostgresOffset.java

...ector-postgres-cdc/src/main/java/io/debezium/connector/postgresql/PostgresObjectFactory.java

...es-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/PostgresSourceBuilder.java

...src/main/java/com/ververica/cdc/connectors/postgres/source/config/PostgresSourceOptions.java

...-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/utils/PostgresTypeUtils.java

.../java/com/ververica/cdc/connectors/postgres/source/fetch/PostgresSourceFetchTaskContext.java

ruanhang1993 · 2023-06-12T09:42:23Z

Hi, @xiaom .
I have finished reviewing the PR. Please take a look at it and make the CI succeed. Then we could merge this PR.
Thanks ~

xiaom · 2023-06-13T07:47:29Z

I've addressed some review feedback ( emoji "👍" marked) as part 1 0861f46. will continue the left one and fix CI later

...es-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/offset/PostgresOffset.java

leonardBang self-requested a review December 12, 2022 06:20

sap1ens mentioned this pull request Dec 21, 2022

[cdc-base] Support Scan Newly Added Tables feature #1838

Closed

sap1ens mentioned this pull request Mar 6, 2023

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1163

Closed

ruanhang1993 self-requested a review April 20, 2023 06:27

ruanhang1993 requested changes May 8, 2023

View reviewed changes

ruanhang1993 mentioned this pull request Jun 9, 2023

[debezium] Bump debezium version to 1.9.7.Final #2156

Merged

xiaom and others added 3 commits June 9, 2023 12:54

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector

eaf9bf8

Co-Authored-By: Yaroslav Tkachenko <260702+sap1ens@users.noreply.github.com>

fixup! [Postgres] Use Incremental Snapshot Framework for Postgres CDC…

29e0c26

… Connector

fix rebase issue

4c3ef9d

xiaom force-pushed the goldsky-pg-cdc branch from 0abdc1e to 4c3ef9d Compare June 9, 2023 21:59

ruanhang1993 reviewed Jun 11, 2023

View reviewed changes

ruanhang1993 reviewed Jun 12, 2023

View reviewed changes

address review feedback - part 1

0861f46

add license

e2eece3

ruanhang1993 reviewed Jun 13, 2023

View reviewed changes

...es-cdc/src/main/java/com/ververica/cdc/connectors/postgres/source/offset/PostgresOffset.java Outdated Show resolved Hide resolved

address review comments - part 2

3be1e47

xiaom force-pushed the goldsky-pg-cdc branch from 862ea8c to 3be1e47 Compare June 15, 2023 07:14

add missing copyright and javadoc

7b18f30

ruanhang1993 mentioned this pull request Jun 15, 2023

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #2216

Merged

ruanhang1993 closed this Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1823

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1823

xiaom commented Dec 12, 2022

xiaom commented Dec 16, 2022 •

edited

Loading

leonardBang commented Feb 7, 2023

xiaom commented Feb 8, 2023

1032851561 commented Apr 10, 2023

sap1ens commented Apr 10, 2023

1032851561 commented Apr 10, 2023

ruanhang1993 left a comment

ruanhang1993 commented May 23, 2023 •

edited

Loading

xiaom commented May 24, 2023

ruanhang1993 commented Jun 5, 2023

xiaom commented Jun 5, 2023

xiaom commented Jun 7, 2023

ruanhang1993 commented Jun 8, 2023

xiaom commented Jun 9, 2023 •

edited

Loading

ruanhang1993 left a comment

ruanhang1993 Jun 11, 2023

ruanhang1993 commented Jun 12, 2023

xiaom commented Jun 13, 2023

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1823

[Postgres] Use Incremental Snapshot Framework for Postgres CDC Connector #1823

Conversation

xiaom commented Dec 12, 2022

xiaom commented Dec 16, 2022 • edited Loading

leonardBang commented Feb 7, 2023

xiaom commented Feb 8, 2023

1032851561 commented Apr 10, 2023

sap1ens commented Apr 10, 2023

1032851561 commented Apr 10, 2023

ruanhang1993 left a comment

Choose a reason for hiding this comment

ruanhang1993 commented May 23, 2023 • edited Loading

xiaom commented May 24, 2023

ruanhang1993 commented Jun 5, 2023

xiaom commented Jun 5, 2023

xiaom commented Jun 7, 2023

ruanhang1993 commented Jun 8, 2023

xiaom commented Jun 9, 2023 • edited Loading

ruanhang1993 left a comment

Choose a reason for hiding this comment

ruanhang1993 Jun 11, 2023

Choose a reason for hiding this comment

ruanhang1993 commented Jun 12, 2023

xiaom commented Jun 13, 2023

xiaom commented Dec 16, 2022 •

edited

Loading

ruanhang1993 commented May 23, 2023 •

edited

Loading

xiaom commented Jun 9, 2023 •

edited

Loading