Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dm: checkpoint mechanism and common QA #8749

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

buchuitoudegou
Copy link
Contributor

First-time contributors' checklist

What is changed, added or deleted? (Required)

  • introduction of checkpoint mechanism and some common QA

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v6.0 (TiDB 6.0 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions)
  • v5.2 (TiDB 5.2 versions)
  • v5.1 (TiDB 5.1 versions)
  • v5.0 (TiDB 5.0 versions)
  • v4.0 (TiDB 4.0 versions)
  • v3.1 (TiDB 3.1 versions)
  • v3.0 (TiDB 3.0 versions)
  • v2.1 (TiDB 2.1 versions)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@CLAassistant
Copy link

CLAassistant commented Mar 24, 2022

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot added missing-translation-status This PR does not have translation status info. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 24, 2022
@buchuitoudegou buchuitoudegou changed the title checkpoint mechanism and QA dm: checkpoint mechanism and common QA Mar 24, 2022
@TomShawn TomShawn added type/enhancement The issue or PR belongs to an enhancement. translation/doing This PR’s assignee is translating this PR. and removed missing-translation-status This PR does not have translation status info. labels Mar 24, 2022
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
@sunzhaoyang
Copy link
Contributor

/cc @lance6716

dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
buchuitoudegou and others added 2 commits March 28, 2022 14:14
Co-authored-by: sunzy <sunzy2@gmail.com>
Co-authored-by: sunzy <sunzy2@gmail.com>
Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved

## Q&A

### 为什么根据配置文件启动,实际的 position 地址却没有根据配置文件定义的 position ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是这个 position 疑问的话,可能不会找到 checkpoint 页面,可以放在外面的 FAQ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里留着吧。其他地方我也提一下。如果是用户在刚开始学习的时候就能直接知晓这个问题了。

buchuitoudegou and others added 7 commits March 28, 2022 14:32
Co-authored-by: lance6716 <lance6716@gmail.com>
Co-authored-by: lance6716 <lance6716@gmail.com>
Co-authored-by: lance6716 <lance6716@gmail.com>
Copy link
Contributor

@sunzhaoyang sunzhaoyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shichun-0415 shichun-0415 self-requested a review March 29, 2022 07:44
@ran-huang ran-huang self-assigned this Apr 12, 2022
@ran-huang ran-huang requested review from ran-huang and removed request for shichun-0415 April 12, 2022 06:32
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
- table checkpoints:每个表同步的 binlog 位置以及此时的表结构(以 TiDB TableInfo JSON 表示)。
- safe mode exit point:如果不为空,小于这个 checkpoint 的 binlog 有可能在同步队列中,但大于它的则一定不在。在任务从错误恢复重启时,它可以清楚地标识出可能已经同步过的 binlog,因此 (table checkpoint, safe mode exit point] 之间的同步作业必须开启 safe mode,**以免出现重复处理而导致的下游报错**。当重启之后,binlog position 超过 safe mode exit point 之后,safe mode 关闭。

由它们的语义我们可以知道,这三个信息的顺序是:global checkpoint ≤ table checkpoints ≤ safe mode exit point。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否具体说明是什么的顺序?比如 DM 读取 checkpoint 的顺序?优先级顺序?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

指的是 binlog position 的前后顺序。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得然然应该还是没懂。你的意思是,这三个信息中,global checkpoint 保存的 binlog 位置必然小于等于 table checkpoints 对吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的,可能说“xxx位置小于yyy“有点奇怪。。详细点说就是:

global checkpoint: binlog-0001
table checkpoints: [binlog-0003, binlog-0004], # 每个表有一个 checkpoint
safe mode exit point: binlog-0007 

dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
1. 当 XID event 或者 rows event 被处理完,会调用 checkShouldFlush 函数,如果超过一定 interval 都还没有 flush 过,会尝试将 checkpoint 写入到下游。
2. 处理完 DDL event 会尝试将内存 checkpoint 写入到下游。
3. 在尝试写入 checkpoint 时,会先判断 global checkpoint 和 table checkpoints 的值是否有改变,没有改变会不写。
4. safe mode exit point:只在 syncer 退出时,等待同步队列写入下游后 flush。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一句能否再明确一下,flush 的是什么?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是 刷新内存中的 checkpoint 信息到持久化存储

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush 还是一个比较常用的词汇,应该不用特别解释了。

dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved
dm/dm-checkpoint.md Outdated Show resolved Hide resolved

checkpoint 一般存放在下游数据库中。如果配置文件没有定义 meta 信息的库名,则默认在 dm_meta 库中。不同的任务的同步进度存放在以任务名为表名的表中。可以通过清理这些表内容的方式来清理 checkpoint(不建议),或者在下次启动任务时新增 `--remove-meta` flag 来清理已存在的 checkpoint。

### async checkpoint 的机制?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

正文里没有提到同步和异步 checkpoint,是否需要补充?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

67 行这里就是先解释了同步 checkpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同步和异步 checkpoint 的机制是一样的,只不过是 persist 到数据库里的方法不同。

dm/dm-checkpoint.md Outdated Show resolved Hide resolved
ran-huang and others added 4 commits April 20, 2022 11:19
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2022
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2022
dm/dm-checkpoint.md Show resolved Hide resolved

## 内存 checkpoint

### 任务启动/重启
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### 任务启动/重启 是什么意思,感觉 heading 和内容没有直接的对应关系。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

意思是,在任务启动或者重启时,DM 对内存 checkpoint 做的初始化或者恢复工作。请问需要改成什么比较好呢

Comment on lines +24 to +29
由于每次访问和写入 flushed checkpoint 的开销很大,因此,将内存 checkpoint 作为 flushed checkpoint 在某种意义上的缓存,可以减少访问和写入 flushed checkpoint 的开销。

当 DM 启动任务时,DM 会从数据库中读取该任务的 flushed checkpoint,并将 flushed checkpoint 恢复成内存 checkpoint:

* 如果 flushed checkpoint 存在,说明这个任务是重启任务,binlog 的同步位置会被定位到 global checkpoint,这里将作为下次同步任务的开始位置。
* 如果 flushed checkpoint 不存在,说明该任务是第一次启动或者重新开始同步,DM 会在下游新建一个 checkpoint。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有个困惑的地方,标题是 内存 checkpoint,但内容上大半是 flushed checkpoint。整篇文档里没有很好地解释清楚这两个 checkpoint 的区别与联系。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

开头的地方有讲两种 checkpoint 代表的含义,简单来说

  • 内存:已被接收并处理的 binlog 位置
  • flushed:已被成功同步到下游的 binlog 位置

实际上,这里就是在解释两种 checkpoint 之间的联系——内存 checkpoint 是从 flushed checkpoint 恢复而来的。


### 内存 checkpoint 的更新时机

- global checkpoint:当遇到 XID event 或 DDL event 时,将当前位置写入 global checkpoint。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XID event 在文档里从未出现过,似乎是个 MySQL 的术语。为了避免读者疑惑,建议添加到 DM glossary 并链接过去。

buchuitoudegou and others added 2 commits May 5, 2022 18:26
Signed-off-by: Ran <huangran@pingcap.com>
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2022
@ti-chi-bot
Copy link
Member

@buchuitoudegou: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ran-huang ran-huang added the needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. label Dec 29, 2022
@ran-huang ran-huang added needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.2 labels Jun 28, 2023
@qiancai
Copy link
Collaborator

qiancai commented Jul 7, 2023

Removed the needs-cherry-pick-release-6.6 label because the v6.6 docs have been archived at https://docs-archive.pingcap.com/zh/tidb/v6.6 and will no longer receive new updates.

@qiancai
Copy link
Collaborator

qiancai commented Aug 25, 2023

Removed the needs-cherry-pick-release-7.0 label because the v7.0 docs have been archived at https://docs-archive.pingcap.com/zh/tidb/v7.0 and will no longer receive new updates.

@qiancai
Copy link
Collaborator

qiancai commented Oct 20, 2023

Removed the needs-cherry-pick-release-7.2 label because the v7.2 docs have been archived at https://docs-archive.pingcap.com/zh/tidb/v7.2 and will no longer receive new updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. translation/doing This PR’s assignee is translating this PR. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants