Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added descriptions about slow node detection #6174

Merged
merged 9 commits into from
Aug 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions best-practices/pd-scheduling-best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,3 +280,5 @@ For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of regions are inaccur
If a TiKV node fails, PD defaults to setting the corresponding node to the **down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for regions involved.

Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout.

In TiDB v5.2.0, TiKV introduces the mechanism of slow TiKV node detection. By sampling the requests in TiKV, it calculates a score ranging from 1 to 100. A TiKV node with a score greater than or equal to 80 is marked as slow. You can add [`evict-slow-store-scheduler`](/pd-control.md#scheduler-show--add--remove--pause--resume--config) to detect and schedule slow nodes. When one and only one slow node appears, and the slow score reaches the upper limit (100 by default), all leaders in the node will be evicted.
3 changes: 2 additions & 1 deletion pd-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -700,7 +700,8 @@ Usage:
>> scheduler add evict-leader-scheduler 1 // Move all the Region leaders on store 1 out
>> scheduler config evict-leader-scheduler // Display the stores in which the scheduler is located since v4.0.0
>> scheduler add shuffle-leader-scheduler // Randomly exchange the leader on different stores
>> scheduler add shuffle-region-scheduler // Randomly scheduling the regions on different stores
>> scheduler add shuffle-region-scheduler // Randomly scheduling the Regions on different stores
>> scheduler add evict-slow-store-scheduler // When there is one and only one slow store, evict all Region leaders of that store
>> scheduler remove grant-leader-scheduler-1 // Remove the corresponding scheduler, and `-1` corresponds to the store ID
>> scheduler pause balance-region-scheduler 10 // Pause the balance-region scheduler for 10 seconds
>> scheduler pause all 10 // Pause all schedulers for 10 seconds
Expand Down
7 changes: 7 additions & 0 deletions tikv-configuration-file.md
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,13 @@ Configuration items related to Raftstore
+ Controls whether to enable batch processing of the requests. When it is enabled, the write performance is significantly improved.
+ Default value: `true`

### `inspect-interval`

+ At a certain interval, TiKV inspects the latency of the Raftstore component. This parameter specifies the interval of the inspection. If the latency exceeds this value, this inspection is marked as timeout.
+ Judges whether the TiKV node is slow based on the ratio of timeout inspection.
+ Default value: `"500ms"`
+ Minimum value: `"1ms"`

## Coprocessor

Configuration items related to Coprocessor
Expand Down