Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create choose-index.md #3079

Merged
merged 25 commits into from
Jul 16, 2020
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions choose-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
yikeke marked this conversation as resolved.
Show resolved Hide resolved
title: Index Selection
category: performance
---

# Index Selection

Reading data from storage engines is one of the most time-consuming parts during the SQL execution. Up to now, TiDB supports reading data from different storage engines and different indexes. Query execution performance depends largely on whether you select a suitable index or not.

This section introduces how to select a index to access a table, and some related ways to control index selection.

## Access Tables

Before introducing index selection, it is important to understand how TiDB accesses tables, what triggers they are, what differences they make, and what the pros and cons are.

### The Operators of Access Tables

| Operator | Trigger Conditions | Applicable Scenarios | Explanations |
| :------- | :------- | :------- | :---- |
| PointGet / BatchPointGet | The scope of access tables is one or more single point ranges. | Any scene | If triggered, it is usually considered as the fastest operator, since it calls the kvget interface directly to perform the calculations rather than calls the coprocessor interface. |
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved
| TableReader | None | Any scene | It is generally considered as the least efficient operator that scan table data directly from the TiKV . It can be selected only if there is a range query on the _tidb_rowid column, or if there are no other access tables operators to choose from. |
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved
| TableReader | The table has a replica on the TiFlash node. | There are fewer columns to read, but many rows to evaluate. | Tiflash is column storage. If you need to calculate a small number of columns and a large number of rows, It is recommended to choose this operator. |
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved
| IndexReader | A table has one or more indexes, and the columns needed for the calculation are included in the indexes. | When there is a smaller range query on the indexes, or when there is an order requirement for indexed columns. | When multiple indexes exist, a reasonable index is selected based on the cost estimation. |
| IndexLookupReader | A table has one or more indexes, and the columns needed for calculation are not completely included in the index. | Same as IndexReader. | Since the index does not completely cover calculated columns, it needs to retrieve rows from a table after reading indexes. There is an extra cost compared to IndexReader operator. |
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

> Note:
>
> The TableReader operator is based on the _tidb_rowid column index, TiFlash is a column storage index, so the choice of index is the choice of a access tables operator.

## Index Selection

TiDB provides a heuristic rule named Skyline-Pruning based on the cost estimation of each access tables operator.It can reduce the probability of wrong index selection caused by wrong estimation.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

### Skyline-Pruning

Skyline-pruning is a heuristic filtering rule for indexes. To judge an index, the following three dimensions are needed:

- Whether it needs to retrieve rows from a table when you select the index to access the table (that is, the plan generated by the index is IndexReader operator or IndexLookupReader operator). Indexes that do not retrieve rows from a table are better on this dimension than indexes that do.

- Select whether the index satisfies a certain order. Because index reading can guarantee the order of certain column sets, indexes that satisfy the query order are superior to indexes that do not satisfy on this dimension.

- How many access conditions are covered by the indexed columns. An “access condition” is a where condition that can be converted to a column range. And the more access conditions an indexed column set covers, the better it is in this dimension.

For these three dimensions, if an index named idx_a is not worse than the index named idx_b in all three dimensions and one of the dimensions is better than Idx_b, then idx_a is preferred.

### Selection Based on the Cost Estimation

After using the Skyline-Pruning rule to rule out inappropriate indexes, the selection of indexes is based entirely on the cost estimation. The cost estimation of access tables requires the following considerations:
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

- The average length of each row of the indexed data in the storage engine.
- The number of rows in the query range generated by the index.
- The cost for retrieving rows from a table.
- The number of ranges generated by index during the query executing.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

According to these factors and the cost model, the optimizer selects a index with the lowest cost to access the table.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

#### Common Problems with Cost Selection Tunning
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

1. The estimated number of rows is not accurate?

This is usually due to stale or inaccurate statistics. You can re-execute the analyze table or modify the parameters of the analyze table.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

2. Statistics are accurate, why read TiFlash faster, and the optimizer chose the TiKV?
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

At present, the cost model of distinguishing from TiFlash and TiKV is still rough. You can decrease the value of tidb_opt_seek_factor parameter, then the optimizer prefers to choose TiFlash.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

3. The statistics are accurate. One index need to retrieve rows from tables, but it actually executes faster than the index that do not retrieve rows from tables. Why select the index that do not retrieve rows from tables?
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

In this case, the cost estimation may be too large for retrieving rows from tables. You can decrease the value of tidb_opt_network_factor parameter in order to reduce the cost for retrieving rows from tables.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved

## Control Index Selection

The index selection can be controlled by a single query through [Optimizer Hints](/optimizer-hints.md).

- `USE_INDEX` / `IGNORE_INDEX` can force the optimizer to use / not use certain indexes.

- `READ_FROM_STORAGE` can force the optimizer to choose the TiKV / TiFlash storage engine for certain tables to execute queries.
miaoqingli marked this conversation as resolved.
Show resolved Hide resolved