Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add Grafana monitoring best practices #1772

Merged
merged 23 commits into from
Jan 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7475d23
tools: add sql-mode config for sync-diff-inspector
junlan-zhang Nov 13, 2019
90664ba
Revert "tools: add sql-mode config for sync-diff-inspector"
junlan-zhang Nov 13, 2019
1812c6d
Merge remote-tracking branch 'upstream/master'
junlan-zhang Nov 19, 2019
57fbcae
Merge remote-tracking branch 'upstream/master'
junlan-zhang Nov 25, 2019
1347dbf
Merge remote-tracking branch 'upstream/master'
junlan-zhang Nov 29, 2019
a5d2047
Merge remote-tracking branch 'upstream/master'
junlan-zhang Dec 9, 2019
d816428
Merge remote-tracking branch 'upstream/master'
junlan-zhang Dec 10, 2019
8ed6601
Merge remote-tracking branch 'upstream/master'
junlan-zhang Dec 11, 2019
3accd66
Merge remote-tracking branch 'upstream/master'
junlan-zhang Dec 13, 2019
557960a
Merge remote-tracking branch 'upstream/master'
junlan-zhang Dec 19, 2019
c66f39b
Merge remote-tracking branch 'upstream/master' into add-grafana-monit…
junlan-zhang Dec 31, 2019
5c6c41b
reference/best-practices: add grafana best practices
junlan-zhang Jan 15, 2020
dc08388
Merge remote-tracking branch 'upstream/master' into add-grafana-monit…
junlan-zhang Jan 15, 2020
a0294ff
reference/best-practices: add grafana best practices for v3.1
junlan-zhang Jan 15, 2020
4a4bda9
reference: fix urls
junlan-zhang Jan 15, 2020
5d2cfea
reference: update invalid links
junlan-zhang Jan 15, 2020
9bd2e5a
reference: wording edits
junlan-zhang Jan 15, 2020
ac54d1c
reference: improve wording
junlan-zhang Jan 15, 2020
b75de91
reference: update wording
junlan-zhang Jan 15, 2020
772f27b
Merge remote-tracking branch 'upstream/master' into add-grafana-monit…
junlan-zhang Jan 15, 2020
b34de18
Merge remote-tracking branch 'upstream/master' into add-grafana-monit…
junlan-zhang Jan 17, 2020
0228c8a
reference/best-practices: update wording
junlan-zhang Jan 17, 2020
34b51cd
Merge branch 'master' into add-grafana-monitor-doc
junlan-zhang Jan 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions dev/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@
- [Highly Concurrent Write Best Practices](/dev/reference/best-practices/high-concurrency.md)
- [HAProxy Best Practices](/dev/reference/best-practices/haproxy.md)
- [PD Scheduling Best Practices](/dev/reference/best-practices/pd-scheduling.md)
- [Grafana Best Practices](/dev/reference/best-practices/grafana-monitor.md)
- [TiSpark](/dev/reference/tispark.md)
+ TiDB Binlog
- [Overview](/dev/reference/tidb-binlog/overview.md)
Expand Down
206 changes: 206 additions & 0 deletions dev/reference/best-practices/grafana-monitor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
title: Best Practices for Monitoring TiDB Using Grafana
summary: Learn seven tips for efficiently using Grafana to monitor TiDB.
category: reference
---

# Best Practices for Monitoring TiDB Using Grafana

When you [deploy a TiDB cluster using TiDB Ansible](/dev/how-to/deploy/orchestrated/ansible.md), a set of [Grafana + Prometheus monitoring platform](/dev/how-to/monitor/overview.md) is deployed simultaneously to collect and display metrics for various components and machines in the TiDB cluster. This document describes best practices for monitoring TiDB using Grafana. It aims to help you use metrics to analyze the status of the TiDB cluster and diagnose problems.

## Monitoring architecture

[Prometheus](https://prometheus.io/) is a time series database with a multi-dimensional data model and a flexible query language. [Grafana](https://grafana.com/) is an open source monitoring system for analyzing and visualizing metrics.

![The monitoring architecture in the TiDB cluster](/media/prometheus-in-tidb.png)

For TiDB 2.1.3 or later versions, TiDB monitoring uses the pull method instead of the push method which is used in previous versions. It is a good adjustment with the following benefits:

- There is no need to restart the entire TiDB cluster if you need to migrate Prometheus. Before adjustment, migrating Prometheus requires restarting the entire cluster because the target address needs to be updated.
- You can deploy 2 separate sets of Grafana + Prometheus monitoring platforms (not highly available) to prevent a single point of monitoring. To do this, execute the deployment command of TiDB ansible twice with different IP addresses.
- The Pushgateway which might become a single point of failure is removed.

## Source and display of monitoring data

The three core components of TiDB (TiDB server, TiKV server and PD server) obtain metrics through the HTTP interface. These metrics are collected from the program code, and the ports are as follows:

| Component | Port |
| :---------- |:----- |
| TiDB server | 10080 |
| TiKV server | 20181 |
| PD server | 2379 |

Execute the following command to check the QPS of a SQL statement through the HTTP interface. Take the TiDB server as an example:

{{< copyable "shell-regular" >}}

```bash
curl http://__tidb_ip__:10080/metrics |grep tidb_executor_statement_total
```

```
# Check the real-time QPS of different types of SQL statements. The numbers below are the cumulative values of counter type (scientific notation).
tidb_executor_statement_total{type="Delete"} 520197
tidb_executor_statement_total{type="Explain"} 1
tidb_executor_statement_total{type="Insert"} 7.20799402e+08
tidb_executor_statement_total{type="Select"} 2.64983586e+08
tidb_executor_statement_total{type="Set"} 2.399075e+06
tidb_executor_statement_total{type="Show"} 500531
tidb_executor_statement_total{type="Use"} 466016
```

The data above is stored in Prometheus and displayed on Grafana. Right-click the panel and then click the **Edit** button (or directly press the <kbd>E</kbd> key) shown in the following figure:
junlan-zhang marked this conversation as resolved.
Show resolved Hide resolved

![The Edit entry for the Metrics tab](/media/best-practices/metric-board-edit-entry.png)

After clicking the **Edit** button, you can see the query expression with the `tidb_executor_statement_total` metric name on the Metrics tab. The meanings of some items on the panel are as follows:

- `rate[1m]`: The growth rate in one minute. It can only be used for the data of counter type.
- `sum`: The sum of values.
- `by type`: The summed data is grouped by type in the original metric value.
- `Legend format`: The format of the metric name.
- `Resolution`: The step width defaults to 15 seconds. Resolution means whether to generate one data point for multiple pixels.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

The query expression on the **Metrics** tab is as follows:

![The query expression on the Metrics tab](/media/best-practices/metric-board-expression.jpeg)

Prometheus supports many query expressions and functions. For more details, refer to [Prometheus official website](https://prometheus.io/docs/prometheus/latest/querying).

## Grafana tips

This section introduces seven tips for efficiently using Grafana to monitor and analyze the metrics of TiDB.

### Tip 1: Check all dimensions and edit the query expression

In the example shown in the [source and display of monitoring data](#source-and-display-of-monitoring-data) section, the data is grouped by type. If you want to know whether you can group by other dimensions and quickly check which dimensions are available, you can use the following method: **Only keep the metric name on the query expression, no calculation, and leave the `Legend format` field blank**. In this way, the original metrics are displayed. For example, the following figure shows that there are three dimensions (`instance`, `job` and `type`):

![Edit query expression and check all dimensions](/media/best-practices/edit-expression-check-dimensions.jpg)

Then you can modify the query expression by adding the `instance` dimension after `type`, and adding `{{instance}}` to the `Legend format` field. In this way, you can check the QPS of different types of SQL statements that are executed on each TiDB server:

![Add an instance dimension to the query expression](/media/best-practices/add-instance-dimension.jpeg)

### Tip 2: Switch the scale of the Y-axis

Take Query Duration as an example, the Y-axis defaults to be on a binary logarithmic scale (log<sub>2</sub>n), which narrows the gap in display. To amplify changes, you can switch it to a linear scale. Comparing the following two figures, you can easily notice the difference in display, and locate the time when an SQL statement runs slowly.

Of course, a linear scale is not suitable for all situations. For example, if you observe the performance trend for the duration of a month, there might be noises with a linear scale, making it hard to observe.

The Y-axis uses a binary logarithmic scale by default:

![The Y-axis uses a binary logarithmic scale](/media/best-practices/default-axes-scale.jpg)

Switch the Y-axis to a linear scale:

![Switch to a linear scale](/media/best-practices/axes-scale-linear.jpg)

> **Tip:**
>
> Combining tip 2 with tip 1, you can find a `sql_type` dimension to help you immediately analyze whether the `SELECT` statement or the `UPDATE` statement is slow; you can even locate the instance with slow SQL statements.

### Tip 3: Modify the baseline of the Y-axis to amplify changes

You might still cannot see the trend after switching to the linear scale. For example, in the following figure, you want to observe the real-time change of `Store size` after scaling the cluster, but due to the large baseline, small changes are not visible. In this situation, you can modify the baseline of the Y-axis from `0` to `auto` to zoom in the upper part. Check the two figures below, you can see data migration begins.

The baseline defaults to `0`:

![Baseline defaults to 0](/media/best-practices/default-y-min.jpeg)

Change the baseline to `auto`:

![Change the baseline to auto](/media/best-practices/y-min-auto.jpg)

### Tip 4: Use Shared crosshair or Tooltip

In the **Settings** panel, there is a **Graph Tooltip** panel option which defaults to **Default**.

![Graphic presentation tools](/media/best-practices/graph-tooltip.jpeg)

You can use **Shared crosshair** and **Shared Tooltip** respectively to test the effect as shown in the following figures. Then, the scales are displayed in linkage, which is convenient to confirm the correlation of two metrics when diagnosing problems.

Set the graphic presentation tool to **Shared crosshair**:

![Set the graphical presentation tool to Shared crosshair](/media/best-practices/graph-tooltip-shared-crosshair.jpeg)

Set the graphical presentation tool to **Shared Tooltip**:

![Set the graphic presentation tool to Shared Tooltip](/media/best-practices/graph-tooltip-shared-tooltip.jpg)

### Tip 5: Enter `IP address:port number` to check the metrics in history

PD's dashboard only shows the metrics of the current leader. If you want to check the status of a PD leader in history and it no longer exists in the drop-down list of the `instance` field, you can manually enter `IP address:2379` to check the data of the leader.

![Check the metrics in history](/media/best-practices/manually-input-check-metric.jpeg)

### Tip 6: Use the `Avg` function

Generally, only `Max` and `Current` functions are available in the legend by default. When the metrics fluctuate greatly, you can add other summary functions such as the `Avg` function to the legend to check the overall trend for the duration of time.

Add summary functions such as the `Avg` function:

![Add summary functions such as Avg](/media/best-practices/add-avg-function.jpeg)

Then check the overall trend:

![Add Avg function to check the overall trend](/media/best-practices/add-avg-function-check-trend.jpg)

### Tip 7: Use the API of Prometheus to obtain the result of query expressions

Grafana obtains data through the API of Prometheus and you can use this API to obtain information as well. In addition, it also has the following usages:

- Automatically obtains information such as the cluster size and status.
- Makes minor changes to the expression to provide information for the report, such as counting the total amount of QPS per day, the peak value of QPS per day, and the response time per day.
- Performs regular health inspection on the important metrics.

The API of Prometheus is shown as follows:

![The API of Prometheus](/media/best-practices/prometheus-api-interface.jpg)

{{< copyable "shell-regular" >}}

```bash
curl -u user:pass 'http://__grafana_ip__:3000/api/datasources/proxy/1/api/v1/query_range?query=sum(tikv_engine_size_bytes%7Binstancexxxxxxxxx20181%22%7D)%20by%20(instance)&start=1565879269&end=1565882869&step=30' |python -m json.tool
```

```
{
"data": {
"result": [
{
"metric": {
"instance": "xxxxxxxxxx:20181"
},
"values": [
[
1565879269,
"1006046235280"
],
[
1565879299,
"1006057877794"
],
[
1565879329,
"1006021550039"
],
[
1565879359,
"1006021550039"
],
[
1565882869,
"1006132630123"
]
]
}
],
"resultType": "matrix"
},
"status": "success"
}
```

## Summary

The Grafana + Prometheus monitoring platform is a very powerful tool. Making good use of it can improve efficiency, saving you a lot of time on analyzing the status of the TiDB cluster. More importantly, it can help you diagnose problems. This tool is very useful in the operation and maintenance of TiDB clusters, especially when there is a large amount of data.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/add-avg-function.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/add-instance-dimension.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/axes-scale-linear.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/default-axes-scale.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/default-y-min.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/graph-tooltip.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/metric-board-edit-entry.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/prometheus-api-interface.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/best-practices/y-min-auto.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions v2.1/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,7 @@
+ Best Practices
- [HAProxy Best Practices](/v2.1/reference/best-practices/haproxy.md)
- [PD Scheduling Best Practices](/v2.1/reference/best-practices/pd-scheduling.md)
- [Grafana Best Practices](/v2.1/reference/best-practices/grafana-monitor.md)
- [TiSpark](/v2.1/reference/tispark.md)
+ TiDB Binlog
- [Overview](/v2.1/reference/tidb-binlog/overview.md)
Expand Down
Loading