Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

put subquery's equal clause into join on clauses instead of filter cl… #3862

Merged
merged 2 commits into from
Oct 19, 2022

Conversation

HuSen8891
Copy link
Contributor

…auses

Which issue does this PR close?

Closes #3789

Rationale for this change

move subquery's equal clause into join on clauses instead of filter clauses.

What changes are included in this PR?

refine existing rule in datafusion/optimizer/src/scalar_subquery_to_join.rs to optimize subquery with equal clause to inner join.

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Oct 17, 2022
Aggregate: groupBy=[[]], aggr=[[MAX(orders.o_custkey)]] [MAX(orders.o_custkey):Int64;N]
Filter: orders.o_custkey = orders.o_custkey [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]
TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]"#;
Inner Join: customer.c_custkey = __sq_1.__value [c_custkey:Int64, c_name:Utf8, __value:Int64;N]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably supporting this via #3781 would be better? Not sure if inner join is faster here than a cross join (with scalar) + filter?

Inner Join: part.p_partkey = partsupp.ps_partkey
Filter: part.p_size = Int32(15) AND part.p_type LIKE Utf8("%BRASS")
TableScan: part projection=[p_partkey, p_mfgr, p_type, p_size]
Inner Join: part.p_partkey = __sq_1.ps_partkey, partsupp.ps_supplycost = __sq_1.__value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@Dandandan
Copy link
Contributor

Great result @AssHero

I think we'll have to do two things

  • make sure q2 is really faster (probably: yes)
  • I think uncorrelated subqueries might be slower after the rewrite, can we avoid doing this for them?

@andygrove
Copy link
Member

@avantgardnerio may be interested in reviewing this too

@andygrove
Copy link
Member

I just tested q2 @ sf=10 with this change and do not see a speedup unfortunately:

master

Query 2 iteration 0 took 42860.6 ms and returned 47107 rows

this PR

Query 2 iteration 0 took 43917.6 ms and returned 47107 rows

@Dandandan
Copy link
Contributor

Dandandan commented Oct 17, 2022

Looking at the q2 join reveals that there is not much benefit from the optimization for this query, the output sizes is already pretty small for this join (no join that "blows up"):

          HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "p_partkey", index: 0 }, Column { name: "ps_partkey", index: 0 })], 
               metrics=[output_rows=6351, input_batches=16, output_batches=16, input_rows=1183098, join_time=10.810036ms]

I think however, adding it to the join for correlated subqueries is still a "safer choice"

@HuSen8891
Copy link
Contributor Author

Great result @AssHero

I think we'll have to do two things

  • make sure q2 is really faster (probably: yes)
  • I think uncorrelated subqueries might be slower after the rewrite, can we avoid doing this for them?

We can just do this optimization for correlated subqueries. For uncorrelated subqueries, filter clause may be better than inner join.

@HuSen8891
Copy link
Contributor Author

Currently,we only put equal clause into join on clause for correlated subqueries.

@avantgardnerio
Copy link
Contributor

@avantgardnerio may be interested in reviewing this too

Seems ok 👍

@Dandandan Dandandan merged commit 13addce into apache:master Oct 19, 2022
@Dandandan
Copy link
Contributor

Thanks @AssHero

@HuSen8891 HuSen8891 deleted the subquery_optimize branch October 24, 2022 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize benchmark q2 subquery filter
4 participants