Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB temporary latency increase and interval occurs #30807

Open
VoxT opened this issue Dec 16, 2021 · 2 comments
Open

TiDB temporary latency increase and interval occurs #30807

VoxT opened this issue Dec 16, 2021 · 2 comments

Comments

@VoxT
Copy link

VoxT commented Dec 16, 2021

Bug Report

1. Minimal reproduce step (Required)

Run TiDB for more than a year.
Deploy 5 TiKVs with max-replicas = 5.
Each TiKV:

  • store size = 1.1TB
  • available size = 2.6TB
  • region size = 38.8k
  • leader = 7.7k

2. What did you expect to see? (Required)

Latency p99 should be stable.

3. What did you see instead (Required)

1. Let's start with what I observed.
  • a. The p99 latency started to increase temporarily and interval in the service database statement metrics, especially on commit statement.
    image

  • b. Jump to the TiDB dashboard. The KV Duration metric seems to be matched the latency pattern above. Other metrics look fine, CPU/MEM/IO is low used.
    image

Zoom on 3 peaks:
image

  • c. The root cause may come from the TiKV. Let's look at TiKV metrics.

  • c1. The RocksDB CPU has 2 times peaked to 100%, which matched the 2 first peaks above.
    image

  • c2. The compaction occurs matched with the CPU peaked above.
    image

  • c3. Raft apply wait duration having high latency

image

The Scheduler CPU is under 20%
image

Raft store CPU, apply CPU is under 50%.
image
image

Block cache is set as tikv default config. The block cache hit is about 80%, and matched with the peak above.
image

2. About the RocksDB CPU

I have read this issue, and it said that RocksDB CPU reaches 100% is normal, but as observed above, I don't know if it impacts TiDB latency or not. Cause there is only matched pattern btw RocksDB CPU and the latency peak of the commit statement.

In my opinion, Compaction makes RockDBs CPU peak at 100% lead to resource exhaustion, Raft group apply log has to wait longer, so on impact to commit operation. Resource exhaustion on RockDB also leads to other operations on RockDB being slower.

So, please help me if there is some way to resolve or optimize this issue.
For example: How to reduce RocksDB CPU used while compaction is triggered.

Thank you.

4. What is your TiDB version? (Required)

Release Version: v3.0.12
Git Commit Hash: 8c4696b
Git Branch: heads/refs/tags/v3.0.12
UTC Build Time: 2020-03-16 09:56:22
GoVersion: go version go1.13 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

@VoxT VoxT added the type/bug The issue is confirmed as a bug. label Dec 16, 2021
@sticnarf
Copy link
Contributor

I think it's okay for the RocksDB CPU to be 100%. The jitter may come from various reasons and it's hard to tell according to the given information.

In the past few versions, we have been improving the stability of TiKV. I hope you can try a later version of TiDB + TiKV to see whether you will encounter the issue again :)

@sticnarf sticnarf removed type/bug The issue is confirmed as a bug. severity/major labels Dec 21, 2021
@sticnarf
Copy link
Contributor

I'm removing the bug tag because it's a performance issue of a rather old version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants