Describe the bug
DataFusion turns aggregation computations from a LogicalPlan node into column references in the higher level plans. To illustrate, consider the following plan:
Projection: item.i_category, count(*)
Sort: count(Int64(1)) AS count(*) AS count(*) ASC NULLS LAST
Projection: item.i_category, count(Int64(1)) AS count(*), count(Int64(1))
Aggregate: groupBy=[[item.i_category]], aggr=[[count(Int64(1))]]
TableScan: item projection=[i_category]
The Projection node immediately above the Aggregate is projecting both count(Int64(1)) AS count(*) AS count(*) and count(Int64(1)). The reference to count(Int64(1)) in both expressions is denoted by a Column reference with the name literally set to count(Int64(1)). This is because from DataFusion's point of view, once it computes the aggregation, it no longer cares about how that column came to be. So all of the layers above only need to refer to the name that DataFusion assigned that column.
This presents a problem to the unparser, because we're tasked with translating this logical plan back into a SQL statement that can be executed.
The DataFusion unparser currently handles this by looking for any aggregate nodes in the LogicalPlan, and using the aggr references to find the underlying expression that computed the column with a name like count(Int64(1)). Once we have the underlying expression, we can send that to the expression unparser to get the correct SQL that produced it.
This approach works well, but in the handling for the ORDER BY clause specifically, DataFusion wasn't being general enough. It was expecting either a literal column reference (or an aliased column reference) and then invoking the aggregation unprojection logic outlined above. But ORDER BY can have arbitrary expressions. The above double aliasing is one example - its a bit strange, but logically its a valid expression.
Another example is from TPCDS Q36 (simplified to just show the relevant part):
select
i_category,
i_class,
grouping(i_category) + grouping(i_class) as lochierarchy
from
store_sales,
item
group by
rollup (i_category, i_class)
order by
grouping(i_category) + grouping(i_class) desc,
case
when grouping(i_category) + grouping(i_class) = 0 then i_category
end
LIMIT
100;
This example has both a binary expression (grouping(i_category) + grouping(i_class) desc) and a case statement case when grouping(i_category) + grouping(i_class) = 0 then i_category end.
These are both valid SQL expressions for ORDER BY clauses, and the column references were embedded inside other expressions. But the previous unparsing logic for sorts was only expecting simple column/single aliased column references.
The reason this worked previously was because DataFusion itself had a bug that didn't handle arbitrary order by expressions - so it was a bit of a happy coincidence that it worked in the unparser. That bug was fixed in #14486 and allowed DataFusion itself to handle arbitrary expressions in the ORDER BY clause. The reason it was a happy coincidence is because the bug in DataFusion was not triggered when the LogicalPlan was created, only when it was executed. But for the unparser usecase, we weren't executing it in DataFusion - we were just using it to unparse the LogicalPlan back to SQL.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
DataFusion turns aggregation computations from a LogicalPlan node into column references in the higher level plans. To illustrate, consider the following plan:
The Projection node immediately above the Aggregate is projecting both
count(Int64(1)) AS count(*) AS count(*)andcount(Int64(1)). The reference tocount(Int64(1))in both expressions is denoted by a Column reference with the name literally set tocount(Int64(1)). This is because from DataFusion's point of view, once it computes the aggregation, it no longer cares about how that column came to be. So all of the layers above only need to refer to the name that DataFusion assigned that column.This presents a problem to the unparser, because we're tasked with translating this logical plan back into a SQL statement that can be executed.
The DataFusion unparser currently handles this by looking for any aggregate nodes in the LogicalPlan, and using the
aggrreferences to find the underlying expression that computed the column with a name likecount(Int64(1)). Once we have the underlying expression, we can send that to the expression unparser to get the correct SQL that produced it.This approach works well, but in the handling for the
ORDER BYclause specifically, DataFusion wasn't being general enough. It was expecting either a literal column reference (or an aliased column reference) and then invoking the aggregation unprojection logic outlined above. ButORDER BYcan have arbitrary expressions. The above double aliasing is one example - its a bit strange, but logically its a valid expression.Another example is from TPCDS Q36 (simplified to just show the relevant part):
This example has both a binary expression (
grouping(i_category) + grouping(i_class) desc) and a case statementcase when grouping(i_category) + grouping(i_class) = 0 then i_category end.These are both valid SQL expressions for ORDER BY clauses, and the column references were embedded inside other expressions. But the previous unparsing logic for sorts was only expecting simple column/single aliased column references.
The reason this worked previously was because DataFusion itself had a bug that didn't handle arbitrary order by expressions - so it was a bit of a happy coincidence that it worked in the unparser. That bug was fixed in #14486 and allowed DataFusion itself to handle arbitrary expressions in the ORDER BY clause. The reason it was a happy coincidence is because the bug in DataFusion was not triggered when the LogicalPlan was created, only when it was executed. But for the unparser usecase, we weren't executing it in DataFusion - we were just using it to unparse the LogicalPlan back to SQL.
To Reproduce
No response
Expected behavior
No response
Additional context
No response