From 65f68351784ed2aa3669e9ae927d5852a447d2c1 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Mon, 29 Mar 2021 19:43:25 +0800
Subject: [PATCH] TiCDC: add trouble shooting info about DDL failures (#5016)

---
 ticdc/troubleshoot-ticdc.md | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md
index 7ae94a4ec2e6f..f1cd93fc844d3 100644
--- a/ticdc/troubleshoot-ticdc.md
+++ b/ticdc/troubleshoot-ticdc.md
@@ -295,6 +295,10 @@ If the Old Value feature is not enabled, you cannot tell whether a Row Changed E
 
 For more information, refer to [Open protocol Row Changed Event format](/ticdc/ticdc-open-protocol.md#row-changed-event).
 
+## How much PD storage does TiCDC use?
+
+TiCDC uses etcd in PD to store and regularly update the metadata. Because the time interval between the MVCC of etcd and PD's default compaction is one hour, the amount of PD storage that TiCDC uses is proportional to the amount of metadata versions generated within this hour. However, in v4.0.5, v4.0.6, and v4.0.7, TiCDC has a problem of frequent writing, so if there are 1000 tables created or scheduled in an hour, it then takes up all the etcd storage and returns the `etcdserver: mvcc: database space exceeded` error. You need to clean up the etcd storage after getting this error. See [etcd maintaince space-quota](https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota) for details. It is recommended to upgrade your cluster to v4.0.9 or later versions.
+
 ## Does TiCDC support replicating large transactions? Is there any risk?
 
 TiCDC provides partial support for large transactions (more than 5 GB in size). Depending on different scenarios, the following risks might exist:
@@ -347,6 +351,21 @@ Since v4.0.8, if the `canal` or `maxwell` protocol is used for output in a chang
     cdc cli changefeed resume -c test-cf --pd=http://10.0.10.25:2379
     ```
 
-## How much PD storage does TiCDC use?
+## How can I find out whether a DDL statement fails to execute in downstream during TiCDC replication? How to resume the replication?
 
-TiCDC uses etcd in PD to store and regularly update the metadata. Because the time interval between the MVCC of etcd and PD's default compaction is one hour, the amount of PD storage that TiCDC uses is proportional to the amount of metadata versions generated within this hour. However, in v4.0.5, v4.0.6, and v4.0.7, TiCDC has a problem of frequent writing, so if there are 1000 tables created or scheduled in an hour, it then takes up all the etcd storage and returns the `etcdserver: mvcc: database space exceeded` error. You need to clean up the etcd storage after getting this error. See [etcd maintaince space-quota](https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota) for details. It is recommended to upgrade your cluster to v4.0.9 or later versions.
+If a DDL statement fails to execute, the replication task (changefeed) automatically stops. The checkpoint-ts is the DDL statement's finish-ts minus one. If you want TiCDC to retry executing this statement in the downstream, use `cdc cli changefeed resume` to resume the replication task. For example:
+
+{{< copyable "shell-regular" >}}
+
+```shell
+cdc cli changefeed resume -c test-cf --pd=http://10.0.10.25:2379
+```
+
+If you want to skip this DDL statement that goes wrong, set the start-ts of the changefeed to the checkpoint-ts (the timestamp at which the DDL statement goes wrong) plus one. For example, if the checkpoint-ts at which the DDL statement goes wrong is `415241823337054209`, execute the following commands to skip this DDL statement:
+
+{{< copyable "shell-regular" >}}
+
+```shell
+cdc cli changefeed update -c test-cf --pd=http://10.0.10.25:2379 --start-ts 415241823337054210
+cdc cli changefeed resume -c test-cf --pd=http://10.0.10.25:2379
+```