-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Eliminate all redundant aggregations #17139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,12 +26,12 @@ use std::sync::Arc; | |
|
|
||
| use datafusion_common::{ | ||
| get_required_group_by_exprs_indices, internal_datafusion_err, internal_err, Column, | ||
| HashMap, JoinType, Result, | ||
| DFSchema, HashMap, JoinType, Result, | ||
| }; | ||
| use datafusion_expr::expr::Alias; | ||
| use datafusion_expr::{ | ||
| logical_plan::LogicalPlan, Aggregate, Distinct, Expr, Projection, TableScan, Unnest, | ||
| Window, | ||
| logical_plan::LogicalPlan, Aggregate, Distinct, EmptyRelation, Expr, Projection, | ||
| TableScan, Unnest, Window, | ||
| }; | ||
|
|
||
| use crate::optimize_projections::required_indices::RequiredIndices; | ||
|
|
@@ -153,23 +153,16 @@ fn optimize_projections( | |
|
|
||
| // Only use the absolutely necessary aggregate expressions required | ||
| // by the parent: | ||
| let mut new_aggr_expr = aggregate_reqs.get_at_indices(&aggregate.aggr_expr); | ||
|
|
||
| // Aggregations always need at least one aggregate expression. | ||
| // With a nested count, we don't require any column as input, but | ||
| // still need to create a correct aggregate, which may be optimized | ||
| // out later. As an example, consider the following query: | ||
| // | ||
| // SELECT count(*) FROM (SELECT count(*) FROM [...]) | ||
| // | ||
| // which always returns 1. | ||
| if new_aggr_expr.is_empty() | ||
| && new_group_bys.is_empty() | ||
| && !aggregate.aggr_expr.is_empty() | ||
| { | ||
| // take the old, first aggregate expression | ||
| new_aggr_expr = aggregate.aggr_expr; | ||
| new_aggr_expr.resize_with(1, || unreachable!()); | ||
| let new_aggr_expr = aggregate_reqs.get_at_indices(&aggregate.aggr_expr); | ||
|
|
||
| if new_group_bys.is_empty() && new_aggr_expr.is_empty() { | ||
| // Global aggregation with no aggregate functions always produces 1 row and no columns. | ||
| return Ok(Transformed::yes(LogicalPlan::EmptyRelation( | ||
| EmptyRelation { | ||
| produce_one_row: true, | ||
| schema: Arc::new(DFSchema::empty()), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not related to this PR, but why we need schema for We probably can simplify using this relationship
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. UPD: actually the description is not accurate, EmptyRelation used with non empty schema in joins and some other places, we need to update the description
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we don't produce any symbols, but maybe we use A different question -- why do we have both,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both structs have much in common, but Although
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why not? |
||
| }, | ||
| ))); | ||
| } | ||
|
|
||
| let all_exprs_iter = new_group_bys.iter().chain(new_aggr_expr.iter()); | ||
|
|
@@ -1146,9 +1139,7 @@ mod tests { | |
| plan, | ||
| @r" | ||
| Aggregate: groupBy=[[]], aggr=[[count(Int32(1))]] | ||
| Projection: | ||
| Aggregate: groupBy=[[]], aggr=[[count(Int32(1))]] | ||
| TableScan: ?table? projection=[] | ||
| EmptyRelation | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. table scan gets replaced with 1-row VALUES (that would be more visible if we merge #17145) we could further eliminate |
||
| " | ||
| ) | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| statement ok | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend we name this file more specifically, or put it in an existing function, perhaps aggregate.slt ? Not a blocker, just a suggestion
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This showed up as a regression so i named this file as a regression test.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, maybe that would be better. But if there is already precedent for regression cases perhaps we can leave this PR as is and move it in a subsequent PR |
||
| CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER) | ||
|
|
||
| statement ok | ||
| INSERT INTO tab1 VALUES(51,14,96) | ||
|
|
||
| query R | ||
| SELECT NULL * AVG(DISTINCT 4) + SUM(col1) AS col0 FROM tab1 | ||
| ---- | ||
| NULL | ||
|
|
||
| query TT | ||
| EXPLAIN SELECT NULL * AVG(DISTINCT 4) + SUM(col1) AS col0 FROM tab1 | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: Float64(NULL) AS col0 | ||
| 02)--EmptyRelation | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[NULL as col0] | ||
| 02)--PlaceholderRowExec | ||
|
|
||
| # Similar, with a few more arithmetic operations | ||
| query R | ||
| SELECT + CAST ( NULL AS INTEGER ) * + + AVG ( DISTINCT 4 ) + - SUM ( ALL + col1 ) AS col0 FROM tab1 | ||
| ---- | ||
| NULL | ||
|
|
||
| query TT | ||
| EXPLAIN SELECT + CAST ( NULL AS INTEGER ) * + + AVG ( DISTINCT 4 ) + - SUM ( ALL + col1 ) AS col0 FROM tab1 | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: Float64(NULL) AS col0 | ||
| 02)--EmptyRelation | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[NULL as col0] | ||
| 02)--PlaceholderRowExec | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,4 +73,3 @@ query I | |
| SELECT getbit(11, NULL); | ||
| ---- | ||
| NULL | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1453,9 +1453,7 @@ logical_plan | |
| 01)LeftSemi Join: | ||
| 02)--TableScan: t1 projection=[a] | ||
| 03)--SubqueryAlias: __correlated_sq_1 | ||
| 04)----Projection: | ||
| 05)------Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] | ||
| 06)--------TableScan: t2 projection=[] | ||
| 04)----EmptyRelation | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems like an improvement to me |
||
|
|
||
| statement count 0 | ||
| drop table t1; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a good rule to have, though perhaps it would be eaiser to find if it were with related rules for aggregates, for example
https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/eliminate_group_by_constant.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO here is a natural place to place this logic. We're pruning and we need to somehow handle the case where we prune the last aggregate function. The code should do something reasonable and now it does.
alternative is to let this place create Agg with empty group by and no aggregate functions (I had so initially), and then have a separate rule that finds such trivial Aggs and replaces with VALUES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it here just makes sense