ticdc: add a monitor metric "Changefeed catch-up ETA" (#10462)

pingcap · Sep 16, 2022 · b83ede8 · b83ede8
1 parent 5757e28
commit b83ede8
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 8 deletions.
diff --git a/media/ticdc/ticdc-dashboard-changefeed-4.png b/media/ticdc/ticdc-dashboard-changefeed-4.png
diff --git a/ticdc/monitor-ticdc.md b/ticdc/monitor-ticdc.md
@@ -45,27 +45,33 @@ The description of each metric in the **Server** panel is as follows:
 The following is an example of the **Changefeed** panel:
 
 ![TiCDC Dashboard - Changefeed metrics 1](/media/ticdc/ticdc-dashboard-changefeed-1.png)
-![TiCDC Dashboard - Changefeed metrics 2](/media/ticdc/ticdc-dashboard-changefeed-2.png)
-![TiCDC Dashboard - Changefeed metrics 3](/media/ticdc/ticdc-dashboard-changefeed-3.png)
-
-The description of each metric in the **Changefeed** panel is as follows:
 
 - Changefeed table count: The number of tables that each TiCDC node needs to replicate in the replication task
 - Processor resolved ts: The timestamps that have been resolved in the TiCDC cluster
 - Table resolved ts: The replication progress of each table in the replication task
 - Changefeed checkpoint: The progress of replicating data to the downstream. Normally, the green bars are connected to the yellow line
 - PD etcd requests/s: The number of requests that a TiCDC node sends to PD per second
-- Exit error count: The number of errors that interrupt the replication task per minute
+- Exit error count/m: The number of errors that interrupt the replication task per minute
 - Changefeed checkpoint lag: The progress lag of data replication (the unit is second) between the upstream and the downstream
-- Changefeed resolved ts lag: The progress lag of data replication (the unit is second) between the upstream and TiCDC nodes
-- Flush sink duration: The histogram of the time spent by TiCDC asynchronously flushing data to the downstream
-- Flush sink duration percentile: The time (P95, P99, and P999) spent by TiCDC asynchronously flushing data to the downstream within one second
+- Processor resolved ts lag: The progress lag of data replication (the unit is second) between the upstream and TiCDC nodes
+
+![TiCDC Dashboard - Changefeed metrics 2](/media/ticdc/ticdc-dashboard-changefeed-2.png)
+
 - Sink write duration: The histogram of the time spent by TiCDC writing a transaction change to the downstream
 - Sink write duration percentile: The time (P95, P99, and P999) spent by TiCDC writing a transaction change to the downstream within one second
+- Flush sink duration: The histogram of the time spent by TiCDC asynchronously flushing data to the downstream
+- Flush sink duration percentile: The time (P95, P99, and P999) spent by TiCDC asynchronously flushing data to the downstream within one second
+
+![TiCDC Dashboard - Changefeed metrics 3](/media/ticdc/ticdc-dashboard-changefeed-3.png)
+
 - MySQL sink conflict detect duration: The histogram of the time spent on detecting MySQL sink conflicts
 - MySQL sink conflict detect duration percentile: The time (P95, P99, and P999) spent on detecting MySQL sink conflicts within one second
 - MySQL sink worker load: The workload of MySQL sink workers of TiCDC nodes
 
+![TiCDC Dashboard - Changefeed metrics 4](/media/ticdc/ticdc-dashboard-changefeed-4.png)
+
+- Changefeed catch-up ETA: The estimated time needed for the replication task to catch up with the upstream cluster data. When the upstream write speed is faster than the TiCDC replication speed, the metric might be extremely large. Because TiCDC replication speed is subject to many factors, this metric is for reference only and might not be the actual replication time.
+
 ## Events
 
 The following is an example of the **Events** panel: