Skip to content

Eliminate Self Joins #16023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

atahanyorganci
Copy link
Contributor

@atahanyorganci atahanyorganci commented May 11, 2025

This PR implents two EliminateAggregationSelfJoin and EliminateUniqueKeyedSelfJoin optimization rules. As the name suggests these optimizations eliminate self joins when the expression returned can be rewritten more efficently with a window function or removed entirely respectively.

EliminateUniqueKeyedSelfJoin

In an inner join when joined column forms a unique index on the table a single row is returned.

High level overview of conditions that make this optimization possible is both sides of the join refer to same table, then joined columns when combined must form a unique index on the shared table. If any join filter is specified whether query produces a single row cannot be determined, as filter expression may return false.

Optimization Step

Sample optimizable query with employees table and id as the primary key.

SELECT a.id
FROM employees a
JOIN employees b ON a.id = b.id
WHERE b.department = 'HR';

Unoptimized logical plan.

Projection: a.id
  Filter: b.department = Utf8("HR")
    Inner Join:  Filter: a.id = b.id
      SubqueryAlias: a
        TableScan: employees
      SubqueryAlias: b
        TableScan: employees

The query can be optimized by removing duplicate TableScan and SubqueryAlias nodes.

Projection: a.id
  Filter: a.department = Utf8("HR")
    SubqueryAlias: a
      TableScan: employees

## EliminateAggregationSelfJoin

Following sample query with self-join can be used to gather analytics on product sales.

SELECT a.user_id, a.purchase_date, SUM(b.amount) AS running_total
FROM purchases a
JOIN purchases b ON a.user_id = b.user_id AND b.purchase_date <= a.purchase_date
GROUP BY a.user_id, a.purchase_date;

This query can more efficiently rewritten with a window expression.

SELECT
  user_id,
  purchase_date,
  SUM(amount) OVER (PARTITION BY user_idORDER BY purchase_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
  purchases;

Without going into why this works, the conditions where this optimization can be applied is

  • It should be inner join between same tables, a.k.a self-join
  • JOIN ON <join columns> AND <filter expression>
    • Joined columns should NOT form a unique index
    • Filter expression should be binary expression with a comparison operator and should refer to the same column
  • Columns in JOIN ON ... should be the same as GROUP BY

These conditions conservatively allow self-join to be replaced with a window expression.

Closes: #14758

@github-actions github-actions bot added the optimizer Optimizer rules label May 11, 2025
@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch from e2b09c9 to 1de198c Compare May 12, 2025 15:20
@alamb
Copy link
Contributor

alamb commented May 12, 2025

FYI @irenjj

@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch from 1de198c to baf02cf Compare May 12, 2025 19:57
@atahanyorganci atahanyorganci changed the title DRAFT: Eliminate Self Joins [WIP] Eliminate Self Joins May 14, 2025
@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch 2 times, most recently from 2732b86 to 7835159 Compare May 19, 2025 12:33
@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch 11 times, most recently from cc8ca4e to 5c1f098 Compare May 28, 2025 13:31
@atahanyorganci atahanyorganci changed the title [WIP] Eliminate Self Joins Eliminate Self Joins May 28, 2025
@atahanyorganci atahanyorganci marked this pull request as ready for review May 28, 2025 14:01
@atahanyorganci
Copy link
Contributor Author

Current implementation fails in call to assert_valid_optimization where schema of a optimization pass is compared against the previous state. Failure occurs when alias is replaced with an other. For example

SELECT
    b.id
FROM
    employees a
    JOIN employees b USING (id)
WHERE
    b.department = 'HR'

Since alias b won't exist after the optimization it is replaced with a. This causes TableReferences to not be the same and the invariant fails eventhough both of the alises refer to the same table.

@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch from ab215e5 to 14cbb2a Compare May 29, 2025 16:23
This change will allow testing queries with logical dependencies such as unique or primary constraints.
`EliminateSelfJoin` optimization can remove subquery aliases when `TableScan`s are merged. This change updates the optimization rule to run top down fashion so that when an alias is eliminated expressions containing the removed alias is updated to match the new preserved one.
Combine `EliminateUniqueKeyedSelfJoin` and `EliminateAggregationSelfJoin` into single module for sharing code, i.e. `merge_table_scans`.
Add `eliminate_unique_keyed_self_join_multiple` integration test to verify `EliminateUniqueKeyedSelfJoin` runs multiple times and eliminates all self joins.
`EliminateAggregationSelfJoin::rewrite` first operates on `LogicalPlan::Projection` node, this change readds the `Projection` node with column aliases adjusted.

`Aggregate` and `Window` nodes have different names for column aliases so `Projection`s expression has to be updated.
Renamed old `rewrite_expression` to `rewrite_logical_plan` and add `rewrite_expression` method.
@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch from 14cbb2a to 913371b Compare May 29, 2025 16:42
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 30, 2025
Change the node `EliminateUniqueKeyedSelfJoin` to be `LogicalPlan::Projection` similar to `EliminateAggregationSelfJoin`. This change allows expressions that contain reference an eliminated alias to be updated to the new one. For example `SELECT a.id, b.name ...` is updated to `SELECT a.id, a.name as b.name ...` if alias `b` is removed after the optimzation.

Remove assertions that require `JOIN ON ...` to be column references as `JOIN on a.id / 2 = b.id / 2` is valid SQL for some reason.

Make intermediate functions used in the module fallible with more descriptive.
@atahanyorganci atahanyorganci force-pushed the feature/eliminate-self-join branch from 0098853 to 32d7eb4 Compare May 30, 2025 13:18
@jonathanc-n
Copy link
Contributor

I will try to take a look at this this weekend @atahanyorganci

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments will continue tomorrow. I think we should run some benchmarks for some of these cases

}

/// Optimize self-join query by combining LHS and RHS of the join. Current implementation is
/// very conservative. It only merges nodes if one of them is `TableScan`. It should be possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make an issue for this?

filters,
fetch,
)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return result here instead of unwrapping

}

// TODO: equality of `inner` `apachearrow::datatypes::SchemaRef` doesn't mean equality of the tables
fn is_table_scan_same(left: &TableScan, right: &TableScan) -> bool {
Copy link
Contributor

@jonathanc-n jonathanc-n Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also compare against, filter, projection, and source?

plan,
renamed_alias,
} = optimize(left_input, right_input)?;
assert!(renamed_alias.is_none(), "Assert `renamed_alias` is `None` because nested `SubqueryAlias` shouldn't be possible");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we assert instead of propagating error?

_ => Ok(Transformed::no(plan)),
})
.unwrap();
assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

let source_ref = Some(source);

// If LHS column's alias isn't LHS's alias or table name then bail
let left_ref = left_col.relation.as_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right here, relation is returning Some(relation), if this is None then we dont want that returning when both left_ref != left_alias && left_ref != source_ref evaluate to true. I believe it should only check against the alias and source if it is not None (unqualified references should be allowed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Self Join Eliminate
3 participants