Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability (cdc) improve the stability of TiCDC #10343

Open
15 of 20 tasks
zhangjinpeng87 opened this issue Dec 21, 2023 · 1 comment
Open
15 of 20 tasks

stability (cdc) improve the stability of TiCDC #10343

zhangjinpeng87 opened this issue Dec 21, 2023 · 1 comment
Assignees

Comments

@zhangjinpeng87
Copy link
Contributor

zhangjinpeng87 commented Dec 21, 2023

How to define stability of TiCDC?

TiCDC as a distributed system, it should continuously provide service with predictable replication lag under any reasonable situations like single TiCDC node failure, single upstream TiKV node failure, single upstream PD node failure, planned rolling upgrade/restart of TiCDC or upstream TiDB cluster, temporarily network partition between one CDC node and other CDC nodes, etc. TiCDC should recover the replication lag by itself quickly and tolerant different resilience cases.

Expected replication lag SLO under different cases

Category Case Description Expected Behavior
Planned Operations Rolling upgrade/restart TiCDC replication lag < 5s
Scale-in/scale-out TiCDC replication lag < 5s
Rolling upgrade/restart upstream PD replication lag < 5s
Rolling upgrade/restart upstream TiKV replication lag < 10s
Scale-in/scale-out upstream TiDB replication lag < 5s
Rolling upgrade/restart downstream Kafka brokers begin to sink ASAP kafka resumed
Rolling upgrade/restart downstream MySQL/TiDB begin to sink ASAP kafka resumed
Unplanned Failures Single TiCDC node (random one) permanent failure replication lag < 1min
Single TiCDC node temporarily failure for 5 minutes replication lag < 1min
PD leader permanent failure or temporarily failure for 5 minutes replication lag < 5s
Network partition between one TiCDC node and PD leader for 5 minutes replication lag < 5s
Network partition between one TiCDC node and other TiCDC nodes replication lag < 5s

Principle of prioritizing TiCDC stability issues

We deal with TiCDC stability issues as following priorities

  • If this issue is related to data correctness or data completeness, it should be top priority(P0), we must fix them ASAP and cherry-pick them to other LTS versions.
  • When this issue happens, the replication/changefeed can be stuck/failed and can't recover by TiCDC itself. We should treat this category as P1 priority. Because if such an issue happens, it means we/users should handle it manually.
  • When this issue happens, the replication lag will increase unexpectedly a lot. This type of issue is not that critical as P1 issues, but will breach the SLO of TiCDC, we treat them as P2 issues.
  • Enhancements like reducing resource usage, ... we treat them as P3 priority.

Tasks Tracking

@flowbehappy
Copy link
Collaborator

#10157 @zhangjinpeng1987 resolve ts gets stuck issue

@zhangjinpeng87 zhangjinpeng87 changed the title Stability (cdc) Improve the stability of TiCDC stability (cdc) improve the stability of TiCDC Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants