[Fix](Join) Fix the bug of outer join function under vectorization #9954

EmmyMiao87 · 2022-06-04T14:19:04Z

Proposed changes

Problem Summary

This pr mainly fixes the following problems:

Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the tupleisnull function is inconsistent with the row storage. Currently the vectorized tupleisnull will be calculated in the HashJoinNode.computeOutputTuple() function.
Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage.

About 1:

For example：

select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1

At this time, the nullable property of column k1 in the tmp inline view should be true.

In the vectorized code, the virtual tableRef of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the correctness of the column nullable property of this tableRef is very important.
In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable.

In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the TupleIsNull function in outputsmp to ensure data correctness.
That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result.

About 2

The vectorized nullable attribute requirements are very strict.
Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer.
Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer.
So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem.
(At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.)

Checklist(Required)

Does it affect the original behavior: (Yes/No/I Don't know)
Has unit tests been added: (Yes/No/No Need)
Has document been added or modified: (Yes/No/No Need)
Does it need to update dependencies: (Yes/No)
Are there any changes that cannot be rolled back: (Yes/No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

morningman · 2022-06-04T15:23:03Z

Could you add some ut for this?

HappenLee · 2022-06-13T04:18:49Z

gensrc/thrift/PlanNodes.thrift

@@ -410,6 +410,10 @@ struct THashJoinNode {

  // hash output column
  6: optional list<Types.TSlotId> hash_output_slot_ids
+
+  7: optional list<Exprs.TExpr> srcExprList


use src_expr_list

morrySnow · 2022-06-20T10:59:55Z

fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java

@@ -893,7 +893,6 @@ private void filterDeletedRows(Analyzer analyzer) throws AnalysisException {
            SlotRef deleteSignSlot = new SlotRef(desc.getAliasAsName(), Column.DELETE_SIGN);
            deleteSignSlot.analyze(analyzer);
            deleteSignSlot.getDesc().setIsMaterialized(true);
-            deleteSignSlot.getDesc().setIsNullable(analyzer.isOuterMaterializedJoined(desc.getId()));


we also need to remove setIsNullable in Analyzer#registerColumnRef

morrySnow · 2022-06-20T11:04:14Z

gensrc/thrift/PlanNodes.thrift

+
+  7: optional list<Exprs.TExpr> srcExprList
+
+  8: optional Types.TTupleId voutput_tuple_id


move it to TPlanNode

morningman

LGTM

github-actions · 2022-06-24T11:10:55Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-06-24T11:10:57Z

PR approved by anyone and no changes requested.

morningman

hold for a second, there is still some problem

EmmyMiao87 · 2022-06-27T04:01:59Z

The contents of this pr are merged into the new pr #10437 and submitted together

Hash join node adds three new attributes. The following will take an SQL as an example to illustrate the meaning of these three attributes ``` select t1. a from t1 left join t2 on t1. a=t2. b; ``` 1. vOutputTupleDesc：Tuple2(a'') 2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>) 2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')> The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap. This code mainly merges the contents of two PRs: 1. [fix](vectorized) Support outer join for vectorized exec engine (apache#10323) 2. [Fix](Join) Fix the bug of outer join function under vectorization apache#9954 The following is the specific description of the first PR In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple. The following is the specific description of the second PR This pr mainly fixes the following problems: 1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function. 2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage. For example： ``` select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1 ``` At this time, the nullable property of column k1 in the `tmp` inline view should be true. In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important. In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable. In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness. That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result. The vectorized nullable attribute requirements are very strict. Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer. Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer. So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem. (At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.) Co-authored-by: EmmyMiao87 Co-authored-by: HappenLee Co-authored-by: morrySnow

) Hash join node adds three new attributes. The following will take an SQL as an example to illustrate the meaning of these three attributes ``` select t1. a from t1 left join t2 on t1. a=t2. b; ``` 1. vOutputTupleDesc：Tuple2(a'') 2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>) 2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')> The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap. This code mainly merges the contents of two PRs: 1. [fix](vectorized) Support outer join for vectorized exec engine (#10323) 2. [Fix](Join) Fix the bug of outer join function under vectorization #9954 The following is the specific description of the first PR In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple. The following is the specific description of the second PR This pr mainly fixes the following problems: 1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function. 2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage. For example： ``` select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1 ``` At this time, the nullable property of column k1 in the `tmp` inline view should be true. In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important. In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable. In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness. That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result. The vectorized nullable attribute requirements are very strict. Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer. Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer. So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem. (At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.) Co-authored-by: EmmyMiao87 Co-authored-by: HappenLee Co-authored-by: morrySnow Co-authored-by: EmmyMiao87 <522274284@qq.com>

…che#11068) Hash join node adds three new attributes. The following will take an SQL as an example to illustrate the meaning of these three attributes ``` select t1. a from t1 left join t2 on t1. a=t2. b; ``` 1. vOutputTupleDesc：Tuple2(a'') 2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>) 2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')> The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap. This code mainly merges the contents of two PRs: 1. [fix](vectorized) Support outer join for vectorized exec engine (apache#10323) 2. [Fix](Join) Fix the bug of outer join function under vectorization apache#9954 The following is the specific description of the first PR In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple. The following is the specific description of the second PR This pr mainly fixes the following problems: 1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function. 2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage. For example： ``` select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1 ``` At this time, the nullable property of column k1 in the `tmp` inline view should be true. In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important. In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable. In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness. That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result. The vectorized nullable attribute requirements are very strict. Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer. Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer. So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem. (At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.) Co-authored-by: EmmyMiao87 Co-authored-by: HappenLee Co-authored-by: morrySnow Co-authored-by: EmmyMiao87 <522274284@qq.com>

github-actions bot added the area/planner Issues or PRs related to the query planner label Jun 4, 2022

EmmyMiao87 added the area/vectorization label Jun 4, 2022

morningman added this to the v1.1 milestone Jun 4, 2022

morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Jun 4, 2022

HappenLee reviewed Jun 13, 2022

View reviewed changes

EmmyMiao87 force-pushed the outerjoin_tuple branch 3 times, most recently from a184356 to 12ce8c4 Compare June 20, 2022 09:49

morrySnow reviewed Jun 20, 2022

View reviewed changes

EmmyMiao87 force-pushed the outerjoin_tuple branch 2 times, most recently from 5d367fb to 4bf85f1 Compare June 21, 2022 07:59

morningman closed this Jun 22, 2022

morningman removed the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Jun 22, 2022

EmmyMiao87 reopened this Jun 23, 2022

EmmyMiao87 force-pushed the outerjoin_tuple branch 2 times, most recently from cf8fb62 to 1c4f4b4 Compare June 24, 2022 04:59

EmmyMiao87 changed the title ~~[Enhancement](Join) Construct output tuple in join node~~ [Fix](Join) Solve the problem of incorrectly evaluating column null properties when combining inline view and outer join. Jun 24, 2022

morningman approved these changes Jun 24, 2022

View reviewed changes

morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Jun 24, 2022

github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 24, 2022

github-actions bot added the reviewed label Jun 24, 2022

morningman requested changes Jun 26, 2022

View reviewed changes

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jun 26, 2022

EmmyMiao87 and others added 2 commits June 27, 2022 10:54

fix inline view

a03239c

[Bug] BE prevent core by nullable not suit in hash join node

0ff70cc

EmmyMiao87 force-pushed the outerjoin_tuple branch from ca5d702 to 0ff70cc Compare June 27, 2022 02:57

EmmyMiao87 changed the title ~~[Fix](Join) Solve the problem of incorrectly evaluating column null properties when combining inline view and outer join.~~ [Fix](Join) Fix the bug of outer join function under vectorization Jun 27, 2022

EmmyMiao87 mentioned this pull request Jun 27, 2022

[fix](vectorized) Support outer join for vectorized exec engine #10437

Closed

EmmyMiao87 closed this Jun 27, 2022

morningman removed the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Jun 29, 2022

EmmyMiao87 mentioned this pull request Jul 19, 2022

[enhancement](vec) Support outer join for vectorized exec engine #11005

Closed

morningman mentioned this pull request Jul 21, 2022

[enhancement](vec) Support outer join for vectorized exec engine #11068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix](Join) Fix the bug of outer join function under vectorization #9954

[Fix](Join) Fix the bug of outer join function under vectorization #9954

EmmyMiao87 commented Jun 4, 2022 •

edited

Loading

morningman commented Jun 4, 2022

HappenLee Jun 13, 2022

morrySnow Jun 20, 2022

morrySnow Jun 20, 2022

morningman left a comment

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

morningman left a comment

EmmyMiao87 commented Jun 27, 2022 •

edited

Loading


		7: optional list<Exprs.TExpr> srcExprList

		8: optional Types.TTupleId voutput_tuple_id

[Fix](Join) Fix the bug of outer join function under vectorization #9954

[Fix](Join) Fix the bug of outer join function under vectorization #9954

Conversation

EmmyMiao87 commented Jun 4, 2022 • edited Loading

Proposed changes

Problem Summary

About 1:

About 2

Checklist(Required)

Further comments

morningman commented Jun 4, 2022

HappenLee Jun 13, 2022

Choose a reason for hiding this comment

morrySnow Jun 20, 2022

Choose a reason for hiding this comment

morrySnow Jun 20, 2022

Choose a reason for hiding this comment

morningman left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

morningman left a comment

Choose a reason for hiding this comment

EmmyMiao87 commented Jun 27, 2022 • edited Loading

EmmyMiao87 commented Jun 4, 2022 •

edited

Loading

EmmyMiao87 commented Jun 27, 2022 •

edited

Loading