Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC panic: The CRTs must be greater than the resolvedTs #1576

Closed
Tammyxia opened this issue Mar 30, 2021 · 6 comments
Closed

CDC panic: The CRTs must be greater than the resolvedTs #1576

Tammyxia opened this issue Mar 30, 2021 · 6 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. severity/major subject/replication-interruption Denotes an issue or pull request is related to replication interruption. type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • Create tidb cluster + 3x cdc server binary version + 1x changfeed, sysben insert data to upstream
  • Restart all pd: tiup cluster restart tidb-test -R pd
  • Aftert 5 minutes Find cdc server can recovery and changfeed can continue to sync.
  • Repeat step 2,3 per 10 minutes, find cdc oom killed as a known issue.
  • Start CDC server aggain: $ cdc server --pd=http://34.217.25.239:2379 --advertise-addr=52.24.165.124:8300 --addr=0.0.0.0:8300 --log-file=cdc.log --sort-dir=/tmp/cdc-lx
  1. What did you expect to see?

  2. What did you see instead?
    cdc server exit with error message:

  • {"level":"warn","ts":"2021-03-30T03:30:33.586Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-cf42803b-6595-4cba-aa32-069309f0e633/34.217.25.239:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    panic: The CRTs must be greater than the resolvedTs

goroutine 48868 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xd1f5ae1c80, 0xcf90f21d00, 0x4, 0x4)
go.uber.org/zap@v1.16.0/zapcore/entry.go:234 +0x567
go.uber.org/zap.(*Logger).Panic(0xc3fc2da8a0, 0x2c78d20, 0x2c, 0xcf90f21d00, 0x4, 0x4)
go.uber.org/zap@v1.16.0/logger.go:226 +0x7f
github.com/pingcap/log.Panic(0x2c78d20, 0x2c, 0xcf90f21d00, 0x4, 0x4)
github.com/pingcap/log@v0.0.0-20201112100606-8f1e84a3abc8/global.go:50 +0xff
github.com/pingcap/ticdc/cdc/puller.(*pullerImpl).Run.func5.1(0xd23faa1180, 0xd23faa1180, 0x1b)
github.com/pingcap/ticdc@/cdc/puller/puller.go:194 +0x39a
github.com/pingcap/ticdc/cdc/puller.(*pullerImpl).Run.func5(0xc3d6ad2768, 0x0)
github.com/pingcap/ticdc@/cdc/puller/puller.go:253 +0x33f
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc3d54a04e0, 0xc3d6aa40c0)
golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:54 +0x66

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      Release Version: v5.0.0
      Edition: Community
      Git Commit Hash: b18eae98087148ad967d7fcbf6586ed7b9607c4b
      Git Branch: heads/refs/tags/v5.0.0
      UTC Build Time: 2021-03-24 15:31:59
      GoVersion: go1.13
      Race Enabled: false
      TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
      Check Table Before Drop: false |
      
      
    • TiCDC version (execute cdc version):

      Release Version: v5.0.0-nightly-15-g8968aae
      Git Commit Hash: 8968aae98bc9832afb0c484a9e01bb8e0c665ba7
      Git Branch: HEAD
      UTC Build Time: 2021-03-29 08:31:59
      
@Tammyxia Tammyxia added the type/bug The issue is confirmed as a bug. label Mar 30, 2021
@amyangfei
Copy link
Contributor

amyangfei commented Mar 30, 2021

  • [2021/03/30 03:28:35.151 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]

    -> [2021/03/30 03:30:25.007 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [checkpointTs=423906485528952842]

[2021/03/29 11:44:22.648 +00:00] [WARN] [client.go:1403] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=5825941] [span="[748000000000000fff245f728000000000ff019fae0000000000fa, 748000000000000fff245f728000000000ff01e1390000000000fa)"] [duration=37.35s] [resolvedTs=423891623960903690]
[2021/03/30 03:28:13.770 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423891533389365346]
[2021/03/30 03:28:35.150 +00:00] [INFO] [client.go:912] ["EventFeed disconnected"] [regionID=5825941] [requestID=14902] [span="[748000000000000fff245f728000000000ff019fae0000000000fa, 748000000000000fff245f728000000000ff01e1390000000000fa)"] [checkpoint=423906485846933529] [error="[CDC:ErrEventFeedEventError]epoch_not_match:<> "]
[2021/03/30 03:28:35.150 +00:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]
[2021/03/30 03:28:35.151 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]
[2021/03/30 03:28:35.151 +00:00] [INFO] [client.go:817] ["start new request"] [request="{\"header\":{\"cluster_id\":6942784005347452364,\"ticdc_version\":\"5.0.0-nightly\"},\"region_id\":5825941,\"region_epoch\":{\"conf_ver\":5,\"version\":1008},\"checkpoint_ts\":423906485846933529,\"start_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AZ+uAAAAAAD6\",\"end_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AeE5AAAAAAD6\",\"request_id\":117402,\"extra_op\":1,\"Request\":null}"] [addr=44.234.9.155:20160]
[2021/03/30 03:28:36.520 +00:00] [INFO] [region_range_lock.go:281] ["lock range blocked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [blockedBy="[\"regionID: 5825941, ver: 1008, start: 748000000000000fff245f728000000000ff019fae0000000000fa, end: 748000000000000fff245f728000000000ff01e1390000000000fa\"]"]
[2021/03/30 03:30:25.002 +00:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]
[2021/03/30 03:30:25.005 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]
[2021/03/30 03:30:25.006 +00:00] [INFO] [region_range_lock.go:281] ["lock range blocked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [blockedBy="[\"regionID: 5825941, ver: 1008, start: 748000000000000fff245f728000000000ff019fae0000000000fa, end: 748000000000000fff245f728000000000ff01e1390000000000fa\"]"]
[2021/03/30 03:30:25.007 +00:00] [INFO] [client.go:716] ["cannot get rpcCtx, retry span"] [regionID=5825941] [span="[748000000000000fff245f728000000000ff019fae0000000000fa, 748000000000000fff245f728000000000ff01e1390000000000fa)"]
[2021/03/30 03:30:25.007 +00:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1390000000000fa] [checkpointTs=423906485846933529]
[2021/03/30 03:30:25.007 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [checkpointTs=423906485528952842]
[2021/03/30 03:30:25.008 +00:00] [INFO] [region_cache.go:426] ["invalidate current region, because others failed on same store"] [region=5825941] [store=44.234.9.155:20160]
[2021/03/30 03:30:25.008 +00:00] [INFO] [client.go:716] ["cannot get rpcCtx, retry span"] [regionID=5825941] [span="[748000000000000fff245f728000000000ff019fae0000000000fa, 748000000000000fff245f728000000000ff0222c00000000000fa)"]
[2021/03/30 03:30:25.008 +00:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [checkpointTs=423906485528952842]
[2021/03/30 03:30:33.758 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff01e1340000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [checkpointTs=423906485528952842]
[2021/03/30 03:30:33.807 +00:00] [INFO] [client.go:817] ["start new request"] [request="{\"header\":{\"cluster_id\":6942784005347452364,\"ticdc_version\":\"5.0.0-nightly\"},\"region_id\":5825941,\"region_epoch\":{\"conf_ver\":5,\"version\":1012},\"checkpoint_ts\":423906485528952842,\"start_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AeE0AAAAAAD6\",\"end_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AiLAAAAAAAD6\",\"request_id\":141344,\"extra_op\":1,\"Request\":null}"] [addr=44.234.9.155:20160]
[2021/03/30 03:30:33.815 +00:00] [INFO] [region_range_lock.go:246] ["tryLockRange stale"] [lockID=387] [regionID=5825941] [startKey=748000000000000fff245f728000000000ff01e1340000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [allOverlapping="[\"regionID: 5825941, ver: 1012, start: 748000000000000fff245f728000000000ff01e1340000000000fa, end: 748000000000000fff245f728000000000ff0222c00000000000fa\"]"]
[2021/03/30 03:30:33.819 +00:00] [INFO] [client.go:632] ["request expired"] [regionID=5825941] [span="[748000000000000fff245f728000000000ff01e1340000000000fa, 748000000000000fff245f728000000000ff0222c00000000000fa)"] [retrySpans="[]"]
  • regionID=6375829, [748000000000000fff245f728000000000ff01e1390000000000fa, 748000000000000fff245f728000000000ff0222c00000000000fa), checkpointTs=423906485528952842
[2021/03/30 03:28:36.520 +00:00] [INFO] [client.go:912] ["EventFeed disconnected"] [regionID=6375829] [requestID=117400] [span="[748000000000000fff245f728000000000ff01e1390000000000fa, 748000000000000fff245f728000000000ff0222c00000000000fa)"] [checkpoint=423906485528952842] [error="[CDC:ErrEventFeedEventError]region_not_found:<region_id:6375829 > "]
[2021/03/30 03:28:33.253 +00:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=387] [regionID=6375829] [startKey=748000000000000fff245f728000000000ff01e1390000000000fa] [endKey=748000000000000fff245f728000000000ff0222c00000000000fa] [checkpointTs=423906485528952842]
  • [2021/03/30 03:30:33.757 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=6879765] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1340000000000fa] [checkpointTs=423906485528952842]
[2021/03/30 03:30:33.757 +00:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=387] [regionID=6879765] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1340000000000fa] [checkpointTs=423906485528952842]
[2021/03/30 03:30:33.758 +00:00] [INFO] [client.go:766] ["creating new stream to store to send request"] [regionID=6879765] [requestID=141190] [storeID=5] [addr=44.234.9.155:20160]
[2021/03/30 03:30:33.805 +00:00] [INFO] [client.go:817] ["start new request"] [request="{\"header\":{\"cluster_id\":6942784005347452364,\"ticdc_version\":\"5.0.0-nightly\"},\"region_id\":6879765,\"region_epoch\":{\"conf_ver\":5,\"version\":1012},\"checkpoint_ts\":423906485528952842,\"start_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AZ+uAAAAAAD6\",\"end_key\":\"dIAAAAAAAA//JF9ygAAAAAD/AeE0AAAAAAD6\",\"request_id\":141190,\"extra_op\":1,\"Request\":null}"] [addr=44.234.9.155:20160]
[2021/03/30 03:30:33.806 +00:00] [INFO] [region_range_lock.go:246] ["tryLockRange stale"] [lockID=387] [regionID=6879765] [startKey=748000000000000fff245f728000000000ff019fae0000000000fa] [endKey=748000000000000fff245f728000000000ff01e1340000000000fa] [allOverlapping="[\"regionID: 6879765, ver: 1012, start: 748000000000000fff245f728000000000ff019fae0000000000fa, end: 748000000000000fff245f728000000000ff01e1340000000000fa\"]"]
[2021/03/30 03:30:33.808 +00:00] [INFO] [client.go:632] ["request expired"] [regionID=6879765] [span="[748000000000000fff245f728000000000ff019fae0000000000fa, 748000000000000fff245f728000000000ff01e1340000000000fa)"] [retrySpans="[]"]
[2021/03/30 03:31:19.651 +00:00] [PANIC] [puller.go:194] ["The CRTs must be greater than the resolvedTs"] [row="{\"OpType\":3,\"Key\":null,\"Value\":null,\"OldValue\":null,\"StartTs\":0,\"CRTs\":423906485528952842,\"RegionID\":6879765}"] [CRTs=423906485528952842] [resolvedTs=423906485846933529] [tableID=3876] [stack="github.com/pingcap/ticdc/cdc/puller.(*pullerImpl).Run.func5.1\n\tgithub.com/pingcap/ticdc@/cdc/puller/puller.go:194\ngithub.com/pingcap/ticdc/cdc/puller.(*pullerImpl).Run.func5\n\tgithub.com/pingcap/ticdc@/cdc/puller/puller.go:253\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]

@amyangfei amyangfei added subject/replication-interruption Denotes an issue or pull request is related to replication interruption. bug-from-internal-test Bugs found by internal testing. labels Apr 26, 2021
@amyangfei amyangfei self-assigned this May 25, 2021
@amyangfei
Copy link
Contributor

amyangfei commented May 25, 2021

Still under investigation, the bug can't be reproduced

@lonng
Copy link
Contributor

lonng commented May 27, 2021

I change the severity from critical to major because this bug can't be reproduced.

@amyangfei
Copy link
Contributor

another case cdc-2021-06-10T03-33-13.437.log.tar.gz

[2021/06/10 01:47:48.893 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-master-dirty] [git-hash=8bcab34127d7268e83d39cc677fcee9dcab44be3] [git-branch=fix_etcd_txn_too_big] [utc-build-time="2021-06-09 10:13:24"] [go-version="go version go1.16.4 linux/amd64"] [failpoint-build=false]

@amyangfei
Copy link
Contributor

amyangfei commented Jun 10, 2021

another case cdc-2021-06-10T03-33-13.437.log.tar.gz

[2021/06/10 01:47:48.893 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-master-dirty] [git-hash=8bcab34127d7268e83d39cc677fcee9dcab44be3] [git-branch=fix_etcd_txn_too_big] [utc-build-time="2021-06-09 10:13:24"] [go-version="go version go1.16.4 linux/amd64"] [failpoint-build=false]

After investigation, this case tends to be the same cause with bug found in the original issue.

A potential cause is a table merges more region beyond its original region range, which causes resolved fallen back. If this is the root cause, no data loss happens

@amyangfei
Copy link
Contributor

This bug has been fixed by #2038 and #2077

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. severity/major subject/replication-interruption Denotes an issue or pull request is related to replication interruption. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants