-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Eliminate Self Joins #16023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Eliminate Self Joins #16023
Conversation
e2b09c9
to
1de198c
Compare
FYI @irenjj |
1de198c
to
baf02cf
Compare
2732b86
to
7835159
Compare
cc8ca4e
to
5c1f098
Compare
Current implementation fails in call to SELECT
b.id
FROM
employees a
JOIN employees b USING (id)
WHERE
b.department = 'HR' Since alias |
ab215e5
to
14cbb2a
Compare
This change will allow testing queries with logical dependencies such as unique or primary constraints.
`EliminateSelfJoin` optimization can remove subquery aliases when `TableScan`s are merged. This change updates the optimization rule to run top down fashion so that when an alias is eliminated expressions containing the removed alias is updated to match the new preserved one.
Combine `EliminateUniqueKeyedSelfJoin` and `EliminateAggregationSelfJoin` into single module for sharing code, i.e. `merge_table_scans`.
Add `eliminate_unique_keyed_self_join_multiple` integration test to verify `EliminateUniqueKeyedSelfJoin` runs multiple times and eliminates all self joins.
`EliminateAggregationSelfJoin::rewrite` first operates on `LogicalPlan::Projection` node, this change readds the `Projection` node with column aliases adjusted. `Aggregate` and `Window` nodes have different names for column aliases so `Projection`s expression has to be updated.
Renamed old `rewrite_expression` to `rewrite_logical_plan` and add `rewrite_expression` method.
14cbb2a
to
913371b
Compare
Change the node `EliminateUniqueKeyedSelfJoin` to be `LogicalPlan::Projection` similar to `EliminateAggregationSelfJoin`. This change allows expressions that contain reference an eliminated alias to be updated to the new one. For example `SELECT a.id, b.name ...` is updated to `SELECT a.id, a.name as b.name ...` if alias `b` is removed after the optimzation. Remove assertions that require `JOIN ON ...` to be column references as `JOIN on a.id / 2 = b.id / 2` is valid SQL for some reason. Make intermediate functions used in the module fallible with more descriptive.
0098853
to
32d7eb4
Compare
I will try to take a look at this this weekend @atahanyorganci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments will continue tomorrow. I think we should run some benchmarks for some of these cases
} | ||
|
||
/// Optimize self-join query by combining LHS and RHS of the join. Current implementation is | ||
/// very conservative. It only merges nodes if one of them is `TableScan`. It should be possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make an issue for this?
filters, | ||
fetch, | ||
) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return result here instead of unwrapping
} | ||
|
||
// TODO: equality of `inner` `apachearrow::datatypes::SchemaRef` doesn't mean equality of the tables | ||
fn is_table_scan_same(left: &TableScan, right: &TableScan) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also compare against, filter, projection, and source?
plan, | ||
renamed_alias, | ||
} = optimize(left_input, right_input)?; | ||
assert!(renamed_alias.is_none(), "Assert `renamed_alias` is `None` because nested `SubqueryAlias` shouldn't be possible"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we assert instead of propagating error?
_ => Ok(Transformed::no(plan)), | ||
}) | ||
.unwrap(); | ||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
let source_ref = Some(source); | ||
|
||
// If LHS column's alias isn't LHS's alias or table name then bail | ||
let left_ref = left_col.relation.as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right here, relation is returning Some(relation), if this is None
then we dont want that returning when both left_ref != left_alias && left_ref != source_ref
evaluate to true. I believe it should only check against the alias and source if it is not None (unqualified references should be allowed).
This PR implents two
EliminateAggregationSelfJoin
andEliminateUniqueKeyedSelfJoin
optimization rules. As the name suggests these optimizations eliminate self joins when the expression returned can be rewritten more efficently with a window function or removed entirely respectively.EliminateUniqueKeyedSelfJoin
In an inner join when joined column forms a unique index on the table a single row is returned.
High level overview of conditions that make this optimization possible is both sides of the join refer to same table, then joined columns when combined must form a unique index on the shared table. If any join filter is specified whether query produces a single row cannot be determined, as filter expression may return false.
Optimization Step
Sample optimizable query with
employees
table andid
as the primary key.Unoptimized logical plan.
The query can be optimized by removing duplicate
TableScan
andSubqueryAlias
nodes.##
EliminateAggregationSelfJoin
Following sample query with self-join can be used to gather analytics on product sales.
This query can more efficiently rewritten with a window expression.
Without going into why this works, the conditions where this optimization can be applied is
JOIN ON <join columns> AND <filter expression>
JOIN ON ...
should be the same asGROUP BY
These conditions conservatively allow self-join to be replaced with a window expression.
Closes: #14758