TiDB temporary latency increase and interval occurs #30807

VoxT · 2021-12-16T09:26:32Z

Bug Report

1. Minimal reproduce step (Required)

Run TiDB for more than a year.
Deploy 5 TiKVs with max-replicas = 5.
Each TiKV:

store size = 1.1TB
available size = 2.6TB
region size = 38.8k
leader = 7.7k

2. What did you expect to see? (Required)

Latency p99 should be stable.

3. What did you see instead (Required)

1. Let's start with what I observed.

a. The p99 latency started to increase temporarily and interval in the service database statement metrics, especially on commit statement.
b. Jump to the TiDB dashboard. The KV Duration metric seems to be matched the latency pattern above. Other metrics look fine, CPU/MEM/IO is low used.

Zoom on 3 peaks:

c. The root cause may come from the TiKV. Let's look at TiKV metrics.
c1. The RocksDB CPU has 2 times peaked to 100%, which matched the 2 first peaks above.
c2. The compaction occurs matched with the CPU peaked above.
c3. Raft apply wait duration having high latency

c4. Other resources like node CPU/MEM/IO are low used < 50%.
c5: Based on this blog How to Do Performance Tuning on TiDB, a Distributed NewSQL Database.

The Scheduler CPU is under 20%

Raft store CPU, apply CPU is under 50%.

Block cache is set as tikv default config. The block cache hit is about 80%, and matched with the peak above.

2. About the RocksDB CPU

I have read this issue, and it said that RocksDB CPU reaches 100% is normal, but as observed above, I don't know if it impacts TiDB latency or not. Cause there is only matched pattern btw RocksDB CPU and the latency peak of the commit statement.

In my opinion, Compaction makes RockDBs CPU peak at 100% lead to resource exhaustion, Raft group apply log has to wait longer, so on impact to commit operation. Resource exhaustion on RockDB also leads to other operations on RockDB being slower.

So, please help me if there is some way to resolve or optimize this issue.
For example: How to reduce RocksDB CPU used while compaction is triggered.

Thank you.

4. What is your TiDB version? (Required)

Release Version: v3.0.12
Git Commit Hash: 8c4696b
Git Branch: heads/refs/tags/v3.0.12
UTC Build Time: 2020-03-16 09:56:22
GoVersion: go version go1.13 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

The text was updated successfully, but these errors were encountered:

sticnarf · 2021-12-21T07:10:22Z

I think it's okay for the RocksDB CPU to be 100%. The jitter may come from various reasons and it's hard to tell according to the given information.

In the past few versions, we have been improving the stability of TiKV. I hope you can try a later version of TiDB + TiKV to see whether you will encounter the issue again :)

sticnarf · 2021-12-21T07:11:04Z

I'm removing the bug tag because it's a performance issue of a rather old version.

VoxT added the type/bug The issue is confirmed as a bug. label Dec 16, 2021

aytrack added component/tikv severity/major labels Dec 17, 2021

sticnarf removed type/bug The issue is confirmed as a bug. severity/major labels Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB temporary latency increase and interval occurs #30807

TiDB temporary latency increase and interval occurs #30807

VoxT commented Dec 16, 2021 •

edited

Loading

sticnarf commented Dec 21, 2021

sticnarf commented Dec 21, 2021

TiDB temporary latency increase and interval occurs #30807

TiDB temporary latency increase and interval occurs #30807

Comments

VoxT commented Dec 16, 2021 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

1. Let's start with what I observed.

2. About the RocksDB CPU

4. What is your TiDB version? (Required)

sticnarf commented Dec 21, 2021

sticnarf commented Dec 21, 2021

VoxT commented Dec 16, 2021 •

edited

Loading