Skip to content

Commit ae06bc8

Browse files
pavelvelikhovanton-bobkovblinkov
authored
English version of CBO documentation (#8783)
Co-authored-by: anton-bobkov <anton-bobkov@yandex-team.ru> Co-authored-by: Ivan Blinkov <ivan@ydb.tech>
1 parent 89871e4 commit ae06bc8

File tree

7 files changed

+116
-1
lines changed

7 files changed

+116
-1
lines changed
89.1 KB
Loading

ydb/docs/en/core/concepts/glossary.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,10 @@ A **secret** is a sensitive piece of metadata that requires special handling. Fo
209209

210210
Like in filesystems, a **folder** or **directory** is a container for other entities. In the case of {{ ydb-short-name }}, these entities can be [tables](#table) (including [external tables](#external-table)), [topics](#topic), other folders, etc.
211211

212+
### Query optimizer {#optimizer}
213+
214+
[**Query optimizer**](https://en.wikipedia.org/wiki/Query_optimization) is a {{ ydb-short-name }} component that takes a logical plan as input and produces the most efficient physical plan with the lowest estimated resource consumption among the alternatives. The {{ ydb-short-name }} query optimizer is described in the [{#T}](optimizer.md) section.
215+
212216
## Advanced terminology {#advanced-terminology}
213217

214218
This section explains terms that are useful to [{{ ydb-short-name }} contributors](../contributor/index.md) and users who want to get a deeper understanding of what's going on inside the system.
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Query optimization in {{ ydb-short-name }}
2+
3+
{{ ydb-short-name }} uses two types of query optimizers: a rule-based optimizer and a cost-based optimizer. The cost-based optimizer is used for complex queries, typically analytical ([OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)), while rule-based optimization works on all queries.
4+
5+
A query plan is a graph of operations, such as reading data from a source, filtering a data stream by a predicate, or performing more complex operations such as [JOIN](../yql/reference/syntax/join.md) and [GROUP BY](../yql/reference/syntax/group_by.md). Optimizers in {{ ydb-short-name }} take an initial query plan as input and transform it into a more efficient plan that is equivalent to the initial one in terms of the returned result.
6+
7+
## Rule-based optimizer
8+
9+
A significant part of the optimizations in {{ ydb-short-name }} applies to almost any query plan, eliminating the need to analyze alternative plans and their costs. The rule-based optimizer consists of a set of heuristic rules that are applied whenever possible. For example, it is beneficial to filter out data as early as possible in the execution plan for any query. Each optimizer rule comprises a condition that triggers the rule and a rewriting logic that is executed when the plan is applied. Rules are applied iteratively as long as any rule conditions match.
10+
11+
## Cost-Based Query Optimizer
12+
13+
The cost-based optimizer is used for more complex optimizations, such as choosing an optimal join order and join algorithms. The cost-based optimizer considers a large number of alternative execution plans for each query and selects the best one based on the cost estimate for each option. Currently, this optimizer only works with plans that contain [JOIN](../yql/reference/syntax/join.md) operations. It chooses the best order for these operations and the most efficient algotithm implementation for each join operation in the plan.
14+
15+
The cost-optimizer consists of three main components:
16+
17+
* Plan enumerator
18+
* Cost estimation function
19+
* Statistics module, which is used to estimate statistics for the cost function
20+
21+
### Plan enumerator
22+
23+
The current Cost-based optimizer in {{ ydb-short-name }} enumerates all useful join trees, for which the join conditions are defined. It first builds a join hypergraph, where the nodes are tables and edges are join conditions. Depending on how the original query is written, the join hypergraph may have quite different topologies, ranging from simple chain-like graphs to complex cliques. The resulting topology of the join graph determines how many possible altenative plans need to be considered by the optimizer.
24+
25+
For example, a star is a common topology in analytical queries, where a main fact table is joined to multiple dimension tables:
26+
27+
```sql
28+
SELECT
29+
P.Brand,
30+
S.Country AS Countries,
31+
SUM(F.Units_Sold)
32+
33+
FROM Fact_Sales F
34+
INNER JOIN Dim_Date D ON (F.Date_Id = D.Id)
35+
INNER JOIN Dim_Store S ON (F.Store_Id = S.Id)
36+
INNER JOIN Dim_Product P ON (F.Product_Id = P.Id)
37+
38+
WHERE D.Year = 1997 AND P.Product_Category = 'tv'
39+
40+
GROUP BY
41+
P.Brand,
42+
S.Country
43+
```
44+
45+
In this query graph, all `Dim...` tables are joined to the `Fact_Sales` fact table:
46+
47+
![Join graph](_assets/Star-Schema.png)
48+
49+
Common topologies also include chains and cliques. A "chain" is a topology where tables are connected to each other sequentially and each table participates in no more than one join. A "clique" is a fully connected graph where each table is connected to every other table.
50+
51+
In practice, OLAP queries often have a topology that is a combination of "star" and "chain" topologies, while complex topologies like "cliques" are very rare.
52+
53+
The topology significantly impacts the number of alternative plans that the optimizer needs to consider. Therefore, the cost-based optimizer limits the number of joins that are compared by exhaustive search, depending on the topology of the original plan. The capabilities of exact optimization in {{ ydb-short-name }} are listed in the following table:
54+
55+
| Topology | Number of supported joins |
56+
| -------- | ------------------------- |
57+
| Chain | 110 |
58+
| Star | 18 |
59+
| Clique | 15 |
60+
61+
{{ ydb-short-name }} uses a modification of the [DPHyp](https://www.researchgate.net/publication/47862092_Dynamic_Programming_Strikes_Back) algorithm to search for the best join order. DPHyp is a modern dynamic programming algorithm for query optimization that avoids enumerating unnecessary alternatives and allows you to optimize plans with `JOIN` operators, complex predicates, and even `GROUP BY` and `ORDER BY` operators.
62+
63+
### Cost estimation function
64+
65+
To compare plans, the optimizer needs to estimate their costs. The cost function estimates the time and resources required to execute an operation in {{ ydb-short-name }}. The primary parameters of the cost function are estimates of the input data size for each operator and the size of its output. These estimates are based on statistics collected from {{ ydb-short-name }} tables, along with an analysis of the plan itself.
66+
67+
### Statistics for the cost-based optimizer {#statistics}
68+
69+
The cost-based optimizer relies on table statistics and individual column statistics. {{ ydb-short-name }} collects and maintains these statistics in the background. You can manually force statistics collection using the [ANALYZE](../yql/reference/syntax/analyze.md) query.
70+
71+
The current set of table statistics includes:
72+
73+
* Number of records
74+
* Table size in bytes
75+
76+
The current set of column statistics includes:
77+
78+
* [Count-min sketch](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch)
79+
80+
### Cost optimization levels
81+
82+
In {{ ydb-short-name }}, you can configure the cost optimization level via the [CostBasedOptimizationLevel](../yql/reference/syntax/pragma.md#costbasedoptimizationlevel) pragma.

ydb/docs/en/core/concepts/toc_i.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,6 @@ items:
2323
- name: Disk subsystem of a cluster
2424
href: cluster/distributed_storage.md
2525
- name: Federated query
26-
include: { path: federated_query/toc_p.yaml, mode: link }
26+
include: { path: federated_query/toc_p.yaml, mode: link }
27+
- name: Query optimizer
28+
href: optimizer.md

ydb/docs/en/core/yql/reference/yql-core/syntax/_includes/pragma/ydb.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22

33
## YDB
44

5+
### `ydb.CostBasedOptimizationLevel` {#costbasedoptimizationlevel}
6+
7+
| Level | Description |
8+
| ----- | ------------------------ |
9+
| 0 | Cost-based optimizer is disabled. |
10+
| 1 | Cost-based optimizer is disabled, but estimates are computed and available. |
11+
| 2 | Cost-based optimizer is enabled only for queries that include [column-oriented tables](../../../../../concepts/glossary.md#column-oriented-table). |
12+
| 3 | Cost-based optimizer is enabled for all queries, but `LookupJoin` is preferred for row-oriented tables. |
13+
| 4 | Cost-based optimizer is enabled for all queries. |
14+
515
### `kikimr.IsolationLevel`
616

717
| Value type | Default |
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# ANALYZE
2+
3+
`ANALYZE` forces the collection of statistics for [{{ ydb-short-name }} cost-based optimizer](../../../concepts/optimizer.md).
4+
5+
## Syntax
6+
7+
```yql
8+
ANALYZE <path_to_table> [ (<column_name> [, ...]) ]
9+
```
10+
11+
This command forces the synchronous collection of table statistics and column statistics for the specified columns or for all columns if none are specified. `ANALYZE` returns once all the requested statistics have been collected and are up to date.
12+
13+
* `path_to_table` — the path to the table for which statistics should be collected.
14+
* `column_name` — collect column statistics only for the specified columns of the table.
15+
16+
The current set of statistics is described in [{#T}](../../../concepts/optimizer.md#statistics).

ydb/docs/en/core/yql/reference/yql-core/syntax/toc_i.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ items:
88
- { name: ALTER VIEW, href: alter-view.md, when: feature_view }
99
- { name: ALTER TOPIC, href: alter_topic.md, when: feature_topic_control_plane }
1010
- { name: ALTER USER, href: alter-user.md, when: feature_user_and_group }
11+
- { name: ANALYZE, href: analyze.md, when: backend_name == "YDB" }
1112
- { name: CREATE GROUP, href: create-group.md, when: feature_user_and_group }
1213
- { name: CREATE TABLE, include: { mode: link, path: create_table/toc_p.yaml } }
1314
- { name: CREATE VIEW, href: create-view.md, when: feature_view }

0 commit comments

Comments
 (0)