Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add doc for slow store detection #6831

Merged
merged 3 commits into from
Aug 12, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions best-practices/pd-scheduling-best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,3 +280,5 @@ Region Merge 速度慢也很有可能是受到 limit 配置的限制(`merge-sc
没有人工介入时,PD 处理 TiKV 节点故障的默认行为是,等待半小时之后(可通过 `max-store-down-time` 配置调整),将此节点设置为 Down 状态,并开始为涉及到的 Region 补充副本。

实践中,如果能确定这个节点的故障是不可恢复的,可以立即做下线处理,这样 PD 能尽快补齐副本,降低数据丢失的风险。与之相对,如果确定这个节点是能恢复的,但可能半小时之内来不及,则可以把 `max-store-down-time` 临时调整为比较大的值,这样能避免超时之后产生不必要的副本补充,造成资源浪费。

在 5.2.0 中引入了 TiKV 的慢节点检测机制,通过对 TiKV 中的请求进行采样计算出一个范围在 1~100 的分数。当某个分数大于 80 时,该节点会被设置为 Slow 状态。可以通过添加 `evict-slow-store-scheduler` 来针对慢节点进行对应的检测和调度,目前仅支持当且仅当出现一个慢节点,将慢节点上的 leader 全部驱逐(其作用类似于 `evict-leader-scheduler`)。
5kbpers marked this conversation as resolved.
Show resolved Hide resolved
7 changes: 7 additions & 0 deletions tikv-configuration-file.md
Original file line number Diff line number Diff line change
Expand Up @@ -638,6 +638,13 @@ raftstore 相关的配置项。
+ 默认值:1
+ 最小值:大于 0

### `inspect-interval`

+ 定期检测 Raftstore 的延迟情况,并且当检测延迟超过该时间时会被记为超时。
5kbpers marked this conversation as resolved.
Show resolved Hide resolved
+ 根据超时的检测延迟的比例计算判断 TiKV 是否为慢节点。
+ 默认值:500ms
+ 最小值:1ms

## coprocessor

coprocessor 相关的配置项。
Expand Down