Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication lag #864

Open
tellienne opened this issue Oct 10, 2024 · 10 comments
Open

Replication lag #864

tellienne opened this issue Oct 10, 2024 · 10 comments

Comments

@tellienne
Copy link

Hello!
Thank you for your work, we really like the connector-lightweight.

Recently we noticed that replication at some point begins to lag, although there are not many events in the database.
After the snapshot, replication works fine for a couple of hours, but then a lag begins and continues to grow.
We have a heartbeat table into which Debezium inserts a value every 15 seconds.
And it is clearly visible that the lag has begun.
Our connector is deployed in k8s, v2.3.0-lt.

On the k8s node at this moment there is enough CPU and RAM, in clickhouse and postgres too.
Do you have any idea why replication becomes so slow?

Also at this moment insertions into the altinity_sink_connector.replica_source_info table in the clickhouse become slow. Several seconds may pass between a row being deleted and a new one appearing in it.

@tellienne
Copy link
Author

we see in metrics:
clickhouse_sink_db_lag - 699622
clickhouse_sink_debezium_lag - 699598

sometimes the lag can grow to a 5583808.
how to understand where data transfer is slowing down? :(

@subkanthi
Copy link
Collaborator

we see in metrics:
clickhouse_sink_db_lag - 699622
clickhouse_sink_debezium_lag - 699598

sometimes the lag can grow to a 5583808.
how to understand where data transfer is slowing down? :(

Hi @tellienne are u able to check the cpu and memory usage of sink connector

@tellienne
Copy link
Author

are u able to check the cpu and memory usage of sink connector

judging by k8s metrics CPU and memory usage has remained virtually unchanged since the first start. At the time of the snapshot, the connector uses 4GB of memory; after the snapshot, the use does not decrease.
At the time of the snapshot, 2 cores are used, after the snapshot - about 150-200m CPU

when replication starts to slow down there is no change in CPU and memory usage.

the debezium_postgres_connector_metrics_TotalNumberOfEventsSeen metric shows that 92,000 events were transmitted in the last 7 hours.

what other information can we provide?

@subkanthi
Copy link
Collaborator

can you check the query_log on the ClickHouse side to check the time it takes to execute the queries?

@tellienne
Copy link
Author

tellienne commented Oct 14, 2024

We checked it on the last lag.
In 2 hours, the Debezium metric recorded 70529 events.
The increase in the number was sharp; in the first half hour, 38825 events were recorded.
During these two hours the size wal files grew by 1GB.
The query execution time on the clickhouse side ranges from 17 to 120 ms.
The clickhouse_sink_debezium_lag metric for the debezium_heartbeat table showed a lag of 2353267 ms.
Data is inserted into the table every 15000ms.
At the same time, the metric debezium_postgres_connector_metrics_MilliSecondsBehindSource during the lag shows no more than 500ms
In total, in the query_log during this time there were 25,113 queries to the database where the lag was observed.
used a query to calculate

            select type, current_database, query_kind, count (written_rows) as total
            from system.query_log where event_time > now () - interval 3 hours and type='QueryFinish' 
            and query_kind not in ('Select')
            and current_database not in ('altinity_sink_connector','default')
            and written_rows != 0
            group by type, current_database, query_kind;

debezium_postgres_connector_metrics_QueueTotalCapacity = 81290
debezium_postgres_connector_metrics_QueueRemainingCapacity metric during lag is not less than 81000

@subkanthi
Copy link
Collaborator

Data is inserted into the table every 15000ms.

Could this be causing the lag? are you running in single.threaded: "true" or multi -threaded.

if its multi threaded, you might want to increase the buffer flush time.
buffer.flush.time.ms and buffer.max.records

@tellienne
Copy link
Author

tellienne commented Oct 15, 2024

are you running in single.threaded: "true" or multi -threaded.

we use default settings. As I understand it, by default it works in a multi-thread.
Should we also increase thread.pool.size? Currently the config states thread.pool.size=20

@tellienne
Copy link
Author

tellienne commented Oct 15, 2024

added to config

thread.pool.size: 20
slot.stream.params: publication_names=catalog_publication
buffer.flush.time.ms: "15000"
buffer.max.records: "100000"

But again we see a lag. In half an hour we see about 35,000 events and the lag is already 30 minutes, judging by the heartbeat table. There seems to be no effect.
The issue is similar to the one described here #598

@tellienne
Copy link
Author

tellienne commented Oct 16, 2024

thanks for the ideas and your help :)
We tried switching the connector to single-threaded mode while we monitor performance.
But we noticed that after the switch, entries no longer appeared in the altinity_sink_connector.replica_source_info table. It's as if the connector has stopped fixing the offset. Is this expected behavior?

The status endpoint only shows

[{"Seconds_Behind_Source":0},{"Replica_Running":true},{"Database":"system"}]

without lsn

UPD: we updated the connector to version 2.3.1 and the problem with offset in single-threaded mode was fixed

@tellienne
Copy link
Author

It looks like switching to single.threaded solved the problem. Over the past hours, the clickhouse_sink_debezium_lag metric has not shown more than 120,000 ms.
It's strange that multi-threading works slower than single-threading mode.
We switched all connectors to single-threaded mode and noticed a performance increase in all of them compared to multi-threaded mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants