[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35779

wangyum · 2022-03-09T03:13:30Z

What changes were proposed in this pull request?

This pr add a new logical plan visitor named DistinctKeyVisitor to find out all the distinct attributes in current logical plan. For example:

spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql("SELECT a, b, a % 10, max(c), sum(b) FROM t GROUP BY a, b").queryExecution.analyzed.distinctKeys

The output is: {a#1, b#2}.

Enhance RemoveRedundantAggregates to remove the aggregation if it is groupOnly and the child can guarantee distinct. For example:

set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid PushDownLeftSemiAntiJoin
create table t1 using parquet as select id a, id as b from range(10);
create table t2 using parquet as select id as a, id as b from range(8);
select t11.a, t11.b from (select distinct a, b from t1) t11 left semi join t2 on (t11.a = t2.a) group by t11.a, t11.b;

Before this PR:

== Optimized Logical Plan ==
Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
+- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
   :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
   :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
   :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
   +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
      +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
         +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

After this PR:

== Optimized Logical Plan ==
Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
:  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
:     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
   +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
      +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	206	193
q38	59	41
q87	127	113

cloud-fan · 2022-03-09T07:57:45Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

@@ -47,6 +47,17 @@ object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
      } else {
        newAggregate
      }
+
+     case agg @ Aggregate(groupingExps, _, child)
+         if agg.groupOnly && child.deterministic &&


I still don't get why child.deterministic is required. If the child output is completely random, then it should not report any distinct keys.

After thinking twice, it is not required.

cloud-fan · 2022-03-09T07:59:20Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

+      Project(agg.aggregateExpressions, child)
+
+    case agg @ Aggregate(groupingExps, aggregateExps, child)
+        if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) &&


isn't it just a.foldable?

No. Alias is'n foldable, but the child is foldable.