Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The syncing task is running, but in fact binlog event isn't syncing to downstream #11391

Open
giant-panda666 opened this issue Jul 5, 2024 · 8 comments
Labels
area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@giant-panda666
Copy link

giant-panda666 commented Jul 5, 2024

What did you do?

  1. runing syncing task, the worker log as below:
    image
  2. after some time, the task is still running, but syncerBinlog pos don't be updated:
    image
  3. if i run resume command, it returns error:
    image
    and syncerBinlog in subTaskStatus still don't be updated
    image
  4. if i run pause command and resume command, worker returns error and then restart syncer. syncerBinlog pos is updated too.
    image
    image

dm version is v7.5.1 deployed by tidb operator.

What did you expect to see?

No response

What did you see instead?

No response

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@giant-panda666 giant-panda666 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Jul 5, 2024
@lance6716
Copy link
Contributor

please provide full log so we know what's happened. Maybe you can also check the binlog event at that position to see its content, and capture the goroutine stack of DM-worker.

@giant-panda666
Copy link
Author

All logs as above. This is the goroutine stack:
goroutine.txt

@lance6716
Copy link
Contributor

goroutine.txt

At the moment of dumping goroutine, dm-worker is waiting to read binlog from relay component, and relay is waiting to read binlog from upstream MySQL. Do you see the lag at that moment?

@giant-panda666
Copy link
Author

giant-panda666 commented Jul 10, 2024

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.

@lance6716
Copy link
Contributor

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.

Just double check, the goroutine was dumped at the time that there's lag between relayBinlog and syncerBinlog? If so, there's a relay.meta file under the relay log folder of dm-worker's relay folder. Can you show its content and the directory structure of relay folder?

@giant-panda666
Copy link
Author

Yes,when i received the alert, i made the goroutine dump and then recovered it by pause-task and resume-task. Maybe i need to wait it reproduced and then i take relay.meta and directory structure down, because it has be recovered.

@giant-panda666
Copy link
Author

giant-panda666 commented Jul 10, 2024

goroutine.txt

One qeustion: why there are 26 connections(grep goroutine.txt by "created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher in goroutine 294") created by handleSourceBound?syncer worker-count is 16 and validators worker-count is 4, so what about there are another 6 connections?

@giant-panda666
Copy link
Author

giant-panda666 commented Jul 25, 2024

@lance6716 It's reproduced.
dm-worker log:
image
relay.meta:
image
query-status result:
image

data.012136 was just created after i caputured the dm-worker log, so the relay log works well. Maybe it's related with reboot of dm-worker' host which dm-master thinks that syner task is running but in fact not, it doesn't be reproduced until reboot of host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.
Projects
Status: Need Triage
Development

No branches or pull requests

3 participants