Skip to content

Commit

Permalink
[doc](nereids) Optimize query rewerite by materialzied view doc (apac…
Browse files Browse the repository at this point in the history
…he#31420)

* [doc](nereids) Optimize the query rewrite by materialized view doc

* [doc](nereids) add more
  • Loading branch information
seawinde authored Feb 27, 2024
1 parent b2e869c commit 1b15db3
Show file tree
Hide file tree
Showing 4 changed files with 144 additions and 45 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,13 @@ under the License.

## Overview

Doris's asynchronous materialized views employ a structure based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern
for transparent rewriting algorithms. Doris can analyze the structural information of the query SQL,
automatically identify suitable materialized views, and attempt transparent rewriting by expressing the
query SQL using the materialized views. By utilizing precomputed materialized view results,
Doris's asynchronous materialized views employ an algorithm based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern
structure information for transparent rewriting.

Doris can analyze the structural information of query SQL, automatically search for suitable materialized views,
and attempt transparent rewriting, utilizing the optimal materialized view to express the query SQL.

By utilizing precomputed materialized view results,
significant improvements in query performance and a reduction in computational costs can be achieved.

Using the three tables: lineitem, orders, and partsupp from TPC-H, let's describe the capability of directly querying
Expand Down Expand Up @@ -127,11 +130,13 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31';
## Transparent Rewriting Capability
### Join rewriting

JOIN rewriting refers to the ability to transparently rewrite a query when the tables used in the query and
the materialized view are the same. This rewriting can occur either by joining the materialized view
and the query inside the JOIN clause or by placing conditions in the WHERE clause outside of the JOIN.
Additionally, under certain conditions, when the types of JOINs in the query and the materialized view do not match,
rewriting can still take place.

Join rewriting refers to when the tables used in the query and the materialization are the same.
In this case, the optimizer will attempt transparent rewriting by either joining the input of the materialized
view with the query or placing the join in the outer layer of the query's WHERE clause.

This pattern of rewriting is supported for multi-table joins and supports inner and left join types.
Support for other types is continually expanding.

**Case 1:**

Expand Down Expand Up @@ -160,9 +165,10 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31';

**Case 2:**

JOIN Derivation (Coming soon)
When the types of JOINs in the query and the materialized view do not match, but the materialized view can provide
all the data required for the query, transparent rewriting can also occur by compensating predicates above the JOIN.
JOIN Derivation occurs when the join type between the query and the materialized view does not match.
In cases where the materialization can provide all the necessary data for the query, transparent rewriting can
still be achieved by compensating predicates outside the join through predicate push down.

For example:

Materialized view definition:
Expand Down Expand Up @@ -201,6 +207,11 @@ o_orderdate;
```

### Aggregate rewriting
In the definitions of both the query and the materialized view, the aggregated dimensions can either be consistent or inconsistent.
Filtering of results can be achieved by using fields from the dimensions in the WHERE clause.

The dimensions used in the materialized view need to encompass those used in the query,
and the metrics utilized in the query can be expressed using the metrics of the materialized view.

**Case 1**

Expand Down Expand Up @@ -245,8 +256,10 @@ o_comment;

**Case 2**

The following query can undergo transparent rewriting. The query and the materialized view use inconsistent
dimensions for aggregation, where the dimensions used by the materialized view include those used by the query.
The following query can be transparently rewritten: the query and the materialization use aggregated dimensions
that are inconsistent, but the dimensions used in the materialized view encompass those used in the query.
The query can filter results using fields from the dimensions.

The query will attempt to roll up using the functions after SELECT, such as the materialized view's
bitmap_union will eventually roll up into bitmap_union_count, maintaining consistency with the semantics of
the count(distinct) in the query.
Expand Down Expand Up @@ -377,17 +390,53 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10;
## Auxiliary Functions
**Data Consistency Issues After Transparent Rewriting**


The unit of `grace_period` is seconds, referring to the permissible time for inconsistency between the materialized
view and the data in the underlying base tables.

For example, setting `grace_period` to 0 means requiring the materialized view to be consistent with the base
table data before it can be used for transparent rewriting. As for external tables,
since changes in data cannot be perceived, the materialized view is used with them.
Regardless of whether the data in the external table is up-to-date or not, this materialized view can be used for
transparent rewriting. If the external table is configured with an HMS metadata source,
it becomes capable of perceiving data changes. Configuring the metadata source and enabling data change
perception functionality will be supported in subsequent iterations.

Setting `grace_period` to 10 means allowing a 10-second delay between the data in the materialized view and
the data in the base tables. If there is a delay of up to 10 seconds between the data in the materialized
view and the data in the base tables, the materialized view can still be used for transparent rewriting within
that time frame.

For internal tables in the materialized view, you can control the maximum delay allowed for the data used by
the transparent rewriting by setting the grace_period property.
the transparent rewriting by setting the `grace_period` property.
Refer to [CREATE-ASYNC-MATERIALIZED-VIEW](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md)

**Viewing and Debugging Transparent Rewrite Hit Information**

You can use the following statements to view the hit information of transparent rewriting for a materialized view. It will display a concise overview of the transparent rewriting process.
You can use the following statements to view the hit information of transparent rewriting for a materialized view.
It will display a concise overview of the transparent rewriting process.

`explain <query_sql>` The information returned is as follows, with the relevant information pertaining to materialized views extracted:
```text
| MaterializedView |
| MaterializedViewRewriteFail: |
| MaterializedViewRewriteSuccessButNotChose: |
| Names: |
| MaterializedViewRewriteSuccessAndChose: |
| Names: mv1
```

**MaterializedViewRewriteFail**: Lists transparent rewrite failures and summarizes the reasons.

**MaterializedViewRewriteSuccessButNotChose**: Transparent rewrite succeeded, but the final CBO did not choose the
materialized view names list.

**MaterializedViewRewriteSuccessAndChose**: Transparent rewrite succeeded, and the materialized view names list
chosen by the CBO.

`explain <query_sql>`

If you want to know the detailed information about materialized view candidates, rewriting, and the final selection process, you can execute the following statement. It will provide a detailed breakdown of the transparent rewriting process.
If you want to know the detailed information about materialized view candidates, rewriting, and the final selection process,
you can execute the following statement. It will provide a detailed breakdown of the transparent rewriting process.

`explain memo plan <query_sql>`

Expand All @@ -401,14 +450,20 @@ If you want to know the detailed information about materialized view candidates,


## Limitations
- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY statements, and
the input to JOIN cannot contain GROUP BY. Only INNER and LEFT OUTER JOIN types are currently supported; other
types of JOIN operations will be supported gradually.
- Materialized views based on External Tables do not guarantee strong consistency of query results.
- No support for rewriting non-deterministic functions, including rand, now, current_time, current_date, random, uuid, etc.
- No support for rewriting window functions.
- The definition of materialized views currently cannot use views and other materialized views.
- Currently, WHERE condition compensation supports cases where the materialized view has no WHERE clause, and
the query has a WHERE clause; or the materialized view has a WHERE clause, and the query's WHERE condition is a
superset of the materialized view's. Currently, range condition compensation is not yet supported,
such as the materialized view definition being a > 5, and the query being a > 10.
- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY clauses.
The input for JOIN can include simple GROUP BY (aggregation on a single table).
Supported types of JOIN operations include INNER and LEFT OUTER JOIN.
Support for other types of JOIN operations will be gradually added.

- Materialized views based on External Tables do not guarantee strong consistency in query results.

- The use of non-deterministic functions to build materialized views is not supported,
including rand, now, current_time, current_date, random, uuid, etc.

- Transparent rewriting does not support window functions and LIMIT.

- Currently, materialized view definitions cannot utilize views or other materialized views.

- Currently, WHERE clause compensation supports scenarios where the materialized view does not have a WHERE clause,
but the query does, or where the materialized view has a WHERE clause and the query's WHERE clause is a superset
of the materialized view's. Range condition compensation is not yet supported but will be added gradually.
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,18 @@ There are two types of partitioning methods for materialized views. If no partit
For example, if the base table is a range partition with a partition field of `create_time` and partitioning by day, and `partition by(ct) as select create_time as ct from t1` is specified when creating a materialized view,
then the materialized view will also be a range partition with a partition field of 'ct' and partitioning by day

The selection of partition fields and the definition of materialized views must meet the following constraints to be successfully created;
otherwise, an error "Unable to find a suitable base table for partitioning" will occur:

- At least one of the base tables used by the materialized view must be a partitioned table.
- Partitioned tables used by the materialized view must employ list or range partitioning strategies.
- The top-level partition column in the materialized view can only have one partition field.
- The SQL of the materialized view needs to use partition columns from the base table.
- If GROUP BY is used, the partition column fields must be after the GROUP BY.
- If window functions are used, the partition column fields must be after the PARTITION BY.
- Data changes should occur on partitioned tables. If they occur on non-partitioned tables, the materialized view needs to be fully rebuilt.
- Using the fields that generate nulls in the JOIN as partition fields in the materialized view prohibits partition incremental updates.

#### property
The materialized view can specify both the properties of the table and the properties unique to the materialized view.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ under the License.
-->

## 概述
Doris 的异步物化视图采用了基于 SPJG(SELECT-PROJECT-JOIN-GROUP-BY)模式的结构信息来进行透明改写的算法
Doris 的异步物化视图采用了基于 SPJG(SELECT-PROJECT-JOIN-GROUP-BY)模式结构信息来进行透明改写的算法

Doris 可以分析查询 SQL 的结构信息,自动寻找满足要求的物化视图,并尝试进行透明改写,使用物化视图来表达查询SQL
Doris 可以分析查询 SQL 的结构信息,自动寻找满足要求的物化视图,并尝试进行透明改写,使用最优的物化视图来表达查询SQL

通过使用预计算的物化视图结果,可以大幅提高查询性能,减少计算成本。

Expand Down Expand Up @@ -126,9 +126,9 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31';

## 透明改写能力
### JOIN 改写
JOIN 改写指的是查询和物化使用的表相同,可以在物化视图和查询 JOIN 的内部输入或者 JOIN 的外部写 WHERE,可以进行改写
Join 改写指的是查询和物化使用的表相同,可以在物化视图和查询 Join 的输入或者 Join 的外层写 where,优化器对此 pattern 的查询会尝试进行透明改写

当查询和物化视图的 Join 的类型不同时,满足一定条件时,也可以进行改写
支持多表 Join,支持 Join 的类型为 inner,left。其他类型在不断拓展中

**用例1:**

Expand All @@ -155,9 +155,9 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31';

**用例2:**

JOIN衍生(Coming soon)
当查询和物化视图的 JOIN 的类型不一致时,但物化可以提供查询所需的所有数据时,通过在 JOIN 的外部补偿谓词,也可以进行透明改写,
举例如下,待支持。
JOIN衍生,当查询和物化视图的 JOIN 的类型不一致时,如果物化可以提供查询所需的所有数据时,通过在 JOIN 的外部补偿谓词,也可以进行透明改写,

举例如下

mv 定义:
```sql
Expand Down Expand Up @@ -195,6 +195,9 @@ o_orderdate;
```

### 聚合改写
查询和物化视图定义中,聚合的维度可以一致或者不一致,可以使用维度中的字段写 WHERE 对结果进行过滤。

物化视图使用的维度需要包含查询的维度,并且查询使用的指标可以使用物化视图的指标来表示。

**用例1**

Expand Down Expand Up @@ -236,11 +239,10 @@ o_comment;

**用例2**

如下查询可以进行透明改写,查询和物化使用聚合的维度不一致,物化视图使用的维度包含查询的维度。 可以使用维度中的字段进行过滤结果
如下查询可以进行透明改写,查询和物化使用聚合的维度不一致,物化视图使用的维度包含查询的维度。 查询可以使用维度中的字段对结果进行过滤

查询会尝试使用物化视图 SELECT 后的函数进行上卷,如物化视图的 `bitmap_union` 最后会上卷成 `bitmap_union_count`,和查询中

`count(distinct)` 的语义 保持一致。
`count(distinct)` 的语义保持一致。

mv 定义:
```sql
Expand Down Expand Up @@ -360,14 +362,34 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10;
## 辅助功能
**透明改写后数据一致性问题**

对于物化视图中的内表,可以通过设定 `grace_period`属性来控制透明改写使用的物化视图所允许数据最大的延迟时间。
`grace_period` 的单位是秒,指的是容许物化视图和所用基表数据不一致的时间。
比如 `grace_period` 设置成0,意味要求物化视图和基表数据保持一致,此物化视图才可用于透明改写;对于外表,因为无法感知数据变更,所以物化视图使用了外表,

无论外表的数据是不是最新的,都可以使用此物化视图用于透明改写,如果外表配置了 HMS 元数据源,是可以感知数据变更的,配置数据源和感知数据变更的功能会在后面迭代支持。

如果设置成10,意味物化视图和基表数据允许10s的延迟,如果物化视图的数据和基表的数据有延迟,如果在10s内,此物化视图都可以用于透明改写。

对于物化视图中的内表,可以通过设定 `grace_period` 属性来控制透明改写使用的物化视图所允许数据最大的延迟时间。
可查看 [CREATE-ASYNC-MATERIALIZED-VIEW](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md)

**查询透明改写命中情况查看和调试**

可通过如下语句查看物化视图的透明改写命中情况,会展示查询透明改写简要过程信息。

`explain <query_sql>`
`explain <query_sql>` 返回的信息如下,截取了物化视图相关的信息
```text
| MaterializedView |
| MaterializedViewRewriteFail: |
| MaterializedViewRewriteSuccessButNotChose: |
| Names: |
| MaterializedViewRewriteSuccessAndChose: |
| Names: mv1
```
**MaterializedViewRewriteFail**:列举透明改写失败及原因摘要。

**MaterializedViewRewriteSuccessButNotChose**:透明改写成功,但是最终CBO没有选择的物化视图名称列表。

**MaterializedViewRewriteSuccessAndChose**:透明改写成功,并且CBO选择的物化视图名称列表。

如果想知道物化视图候选,改写和最终选择情况的过程详细信息,可以执行如下语句,会展示透明改写过程详细的信息。

Expand All @@ -383,11 +405,11 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10;


## 限制
- 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,并且 JOIN 的输入不能包含 GROUP BY,其中JOIN的支持的类型为
- 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,JOIN 的输入可以包含简单的 GROUP BY(单表聚合),其中JOIN的支持的类型为
INNER 和 LEFT OUTER JOIN 其他类型的 JOIN 操作逐步支持。
- 基于 External Table 的物化视图不保证查询结果强一致。
- 不支持非确定性函数的改写,包括 rand、now、current_time、current_date、random、uuid等。
- 不支持窗口函数的改写
- 不支持使用非确定性函数来构建物化视图,包括 rand、now、current_time、current_date、random、uuid等。
- 不支持窗口函数和 LIMIT 的透明改写
- 物化视图的定义暂时不能使用视图和物化视图。
- 目前 WHERE 条件补偿,支持物化视图没有 WHERE,查询有 WHERE情况的条件补偿;或者物化视图有 WHERE 且查询的 WHERE 条件是物化视图的超集。
目前暂时还不支持,范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10。
目前暂时还不支持范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10,逐步支持
Loading

0 comments on commit 1b15db3

Please sign in to comment.