-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
put subquery's equal clause into join on clauses instead of filter cl… #3862
Conversation
Aggregate: groupBy=[[]], aggr=[[MAX(orders.o_custkey)]] [MAX(orders.o_custkey):Int64;N] | ||
Filter: orders.o_custkey = orders.o_custkey [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N] | ||
TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]"#; | ||
Inner Join: customer.c_custkey = __sq_1.__value [c_custkey:Int64, c_name:Utf8, __value:Int64;N] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably supporting this via #3781 would be better? Not sure if inner join is faster here than a cross join (with scalar) + filter?
Inner Join: part.p_partkey = partsupp.ps_partkey | ||
Filter: part.p_size = Int32(15) AND part.p_type LIKE Utf8("%BRASS") | ||
TableScan: part projection=[p_partkey, p_mfgr, p_type, p_size] | ||
Inner Join: part.p_partkey = __sq_1.ps_partkey, partsupp.ps_supplycost = __sq_1.__value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Great result @AssHero I think we'll have to do two things
|
@avantgardnerio may be interested in reviewing this too |
I just tested q2 @ sf=10 with this change and do not see a speedup unfortunately: master
this PR
|
Looking at the q2 join reveals that there is not much benefit from the optimization for this query, the output sizes is already pretty small for this join (no join that "blows up"):
I think however, adding it to the join for correlated subqueries is still a "safer choice" |
We can just do this optimization for correlated subqueries. For uncorrelated subqueries, filter clause may be better than inner join. |
Currently,we only put equal clause into join on clause for correlated subqueries. |
Seems ok 👍 |
Thanks @AssHero |
…auses
Which issue does this PR close?
Closes #3789
Rationale for this change
move subquery's equal clause into join on clauses instead of filter clauses.
What changes are included in this PR?
refine existing rule in datafusion/optimizer/src/scalar_subquery_to_join.rs to optimize subquery with equal clause to inner join.