Skip to content

DB Pipes: Docs on sync control, resync, initial load and more #4037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Amogh-Bharadwaj
Copy link
Contributor

Summary

This PR adds more documentation around DB pipes

Checklist

@Amogh-Bharadwaj Amogh-Bharadwaj requested review from a team as code owners July 7, 2025 20:34
@Amogh-Bharadwaj Amogh-Bharadwaj requested a review from mshustov July 7, 2025 20:34
Copy link

vercel bot commented Jul 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
clickhouse-docs ❌ Failed (Inspect) 💬 Add feedback Jul 9, 2025 1:56pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
clickhouse-docs-jp ⬜️ Ignored (Inspect) Jul 9, 2025 1:56pm
clickhouse-docs-ru ⬜️ Ignored (Inspect) Visit Preview Jul 9, 2025 1:56pm
clickhouse-docs-zh ⬜️ Ignored (Inspect) Visit Preview Jul 9, 2025 1:56pm

### Pull batch size {#pull-batch-size}
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.

The default is **100,000** records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call out a safe maximum, ~10 million for now

The MySQL ClickPipe uses a column on your source table to logically partition the source tables. This column is called the **partition key column**. It is used to divide the source table into partitions, which can then be processed in parallel by the ClickPipe.

:::warning
The partition key column must be indexed in the source table to see a good performance boost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to validate for an index on the partition column?

<img src={snapshot_params} alt="Snapshot parameters" />

#### Snapshot number of rows per partition {#snapshot-number-of-rows-per-partition}
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and each chunk will be processed in parallel. The default value is 100,000 rows per partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "each" as it can give the impression of all chunks processing at once
"chunks will be processed in parallel"

### Monitoring parallel snapshot in Postgres {#monitoring-parallel-snapshot-in-postgres}
You can analyze **pg_stat_activity** to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **FETCH** queries with different CTID ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.

### Limitations {#limitations}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call out compressed hypertables here mayhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants