Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed lag reached more than 10min when inject network partition betweent pdleader and pdfollowers #9229

Closed
Lily2025 opened this issue Jun 14, 2023 · 12 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. type/enhancement The issue or PR belongs to an enhancement.

Comments

@Lily2025
Copy link

Lily2025 commented Jun 14, 2023

What did you do?

1、run tpcc with threads 10 and warehouse 1000
2、After 10 minutes, simulates pd leader is network isolated from all pd followers
fault start time:2023-06-13 09:01:47
3、After 10 minutes, recovery the fault
fault recover time:2023-06-13 09:11:48

What did you expect to see?

lag is less than 30s

What did you see instead?

ticdc lag reached more than 10min after inject fault
image

pd leader changed normally
image

Versions of the cluster

git hash : 1e2f277

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

@Lily2025 Lily2025 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Jun 14, 2023
@Lily2025
Copy link
Author

/remove-area dm
/area ticdc

@ti-chi-bot ti-chi-bot bot added area/ticdc Issues or PRs related to TiCDC. and removed area/dm Issues or PRs related to DM. labels Jun 14, 2023
@fubinzh
Copy link

fubinzh commented Jun 14, 2023

image

@fubinzh
Copy link

fubinzh commented Jun 15, 2023

/severity major

@nongfushanquan
Copy link
Contributor

/assign @asddongmen

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Sep 21, 2023

@nongfushanquan: GitHub didn't allow me to assign the following users: asddongmen.

Note that only pingcap members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @asddongmen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Lily2025 Lily2025 changed the title ticdc lag reached more than 10min when run ha_pdleader_to_pdfollower(all)_network_partition ticdc lag reached more than 10min when inject network partition betweent pdleader and pdfollowers Nov 8, 2023
@Lily2025
Copy link
Author

inject network partition betweent pdleader and pdfollowers
image

two ticdc crash
image
image

@Lily2025 Lily2025 changed the title ticdc lag reached more than 10min when inject network partition betweent pdleader and pdfollowers ticdc restart and changefeed lag reached more than 10min when inject network partition betweent pdleader and pdfollowers Feb 29, 2024
@Lily2025
Copy link
Author

inject network partition between ticdc owner and all other pods,ticdc restart
chaos start ~ chaos end:2024/02/28 19:05:36 ~ 2024/02/28 19:08:36
img_v3_028h_eceea3ca-f765-4c15-b4b3-1b92b7220d4g

ticdc logs:
[2024/02/28 19:08:37.410 +08:00] [ERROR] [tso_dispatcher.go:562] ["[tso] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.200.49.116:2379: i/o timeout""]
[2024/02/28 19:08:37.410 +08:00] [ERROR] [pd.go:228] ["updateTS error"] [txnScope=global] [error="context canceled"] errorVerbose="context canceled[ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:118\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:147\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:152\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:139] ["close tso client"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [client.go:319] ["[pd] http client closed"] [source=tikv-driver]
[2024/02/28 19:08:37.413 +08:00] [WARN] [upstream.go:299] ["etcd session close failed"] [error="etcdserver: requested lease not found"]
[2024/02/28 19:08:37.413 +08:00] [INFO] [upstream.go:305] ["upstream closed"] [upstreamID=7340490029962833542]
[2024/02/28 19:08:38.370 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:38.379 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:47.414 +08:00] [WARN] [check.go:88] ["check TiKV version failed"] [error="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/version.CheckStoreVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:209\ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:83\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:179\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1016] ["[pd] update member urls"] [old-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd:2379]"] [new-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1043] ["[pd] switch leader"] [new-leader=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [old-leader=]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:525] ["[pd] init cluster id"] [cluster-id=7340490029962833542]
[2024/02/28 19:08:47.427 +08:00] [INFO] [client.go:606] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2024/02/28 19:08:47.427 +08:00] [INFO] [tso_client.go:231] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/]
[2024/02/28 19:08:47.428 +08:00] [INFO] [tso_dispatcher.go:323] ["[tso] tso dispatcher created"] [dc-location=global]
[2024/02/28 19:08:47.428 +08:00] [INFO] [client.go:654] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2024/02/28 19:08:47.429 +08:00] [INFO] [tikv_driver.go:200] ["using API V1."]
[2024/02/28 19:08:47.429 +08:00] [INFO] [tso_dispatcher.go:441] ["[tso] tso stream is not ready"] [dc=global]
[2024/02/28 19:08:48.371 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:48.380 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:202] ["[tso] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTimeout]get TSO timeout"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:498] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-addr=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [error="[PD:client:ErrClientGetTSO]get TSO failed, %v: rpc error: code = Canceled desc = context canceled"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [capture.go:335] ["reset capture failed"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:328] ["the capture routine has exited"]
[2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:57.430 +08:00] [ERROR] [server.go:298] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:298\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:139] ["cdc server exits with error"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:707] ["message router closed"] [captureID=a277c9b2-c0b6-4ef0-aa9d-3d51b50cd83f]
[2024/02/28 19:08:57.432 +08:00] [INFO] [server.go:424] ["sort engine manager closed"] [duration=2.032547ms]
[2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:577] ["[pd] exit member loop due to context canceled"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [resource_manager_client.go:295] ["[resource manager] exit resource token dispatcher"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:240] ["exit tso dispatcher loop"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:410] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:186] ["exit tso requests cancel loop"]
[2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:134] ["closing tso client"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:139] ["close tso client"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"]
[2024/02/28 19:08:59.752 +08:00] [INFO] [helper.go:54] ["init log"] [file=/var/lib/ticdc/log/ticdc.log] [level=info]
[2024/02/28 19:08:59.752 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai]
[2024/02/28 19:08:59.752 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v8.0.0-alpha] [git-hash=25ce29c2a1802bbb4cd26008f322728959a91f7a] [git-branch=heads/refs/tags/v8.0.0-alpha] [utc-build-time="2024-02-27 11:37:29"] [go-version="go version go1.21.6 linux/amd64"] [failpoint-build=false]
[2024/02/28 19:08:59.752 +08:00] [INFO] [server.go:125] ["CDC server created"] [pd="[http://tc-pd:2379/]"] [config="{"addr":"0.0.0.0:8301","advertise-addr":"tc-ticdc-1.tc-ticdc-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:8301","log-file":"/var/lib/ticdc/log/ticdc.log","log-level":"info","log":{"file":{"max-size":301,"max-days":0,"max-backups":0},"error-output":"stderr"},"data-dir":"","gc-ttl":86400,"tz":"System","capture-session-ttl":10,"owner-flush-interval":50000000,"processor-flush-interval":50000000,"sorter":{"sort-dir":"/tmp/sorter","cache-size-in-mb":128},"security":{"ca-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"mtls":false,"client-user-required":false,"client-allowed-user":null},"kv-client":{"enable-multiplexing":true,\

@asddongmen asddongmen self-assigned this Feb 29, 2024
@flowbehappy
Copy link
Collaborator

@asddongmen will see whether it can be addressed by etcd-io/etcd#17465 (comment). If not, then I suggest we address it in long term.

@asddongmen
Copy link
Contributor

After the merge of #10881, the checkpointTs lag during pd-leader-io-hang cases was reduced to less than 120s, meeting the requirement.
image

Copy link
Contributor

ti-chi-bot bot commented May 24, 2024

@Lily2025: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Lily2025
Copy link
Author

/remove-type bug
/type enhancement

@ti-chi-bot ti-chi-bot bot added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. labels May 24, 2024
@Lily2025 Lily2025 changed the title ticdc restart and changefeed lag reached more than 10min when inject network partition betweent pdleader and pdfollowers changefeed lag reached more than 10min when inject network partition betweent pdleader and pdfollowers May 24, 2024
@Lily2025
Copy link
Author

closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. type/enhancement The issue or PR belongs to an enhancement.
Projects
Development

No branches or pull requests

6 participants