add more comments

peter-toth · peter-toth · commit 02e3a68dabe6 · 2023-08-02T13:54:19.000+02:00
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/merge/MergeScalarSubqueries.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/merge/MergeScalarSubqueries.scala
@@ -251,7 +251,7 @@ object MergeScalarSubqueries extends Rule[LogicalPlan] {
   // - the merged plan,
   // - the attribute mapping from the new to the merged version,
   // - optional filters of both plans that need to be propagated and merged in an ancestor
-  // `Aggregate` node if possible.
+  //   `Aggregate` node if possible.
   //
   // Please note that merging arbitrary plans can be complicated, the current version supports only
   // some of the most important nodes.
@@ -348,24 +348,130 @@ object MergeScalarSubqueries extends Rule[LogicalPlan] {
             case _ => None
           }
 
-        // If `Filter`s are not exactly the same we can still try propagating up their differing
-        // condition because in some cases we will be able to merge them in an `Aggregate` parent
-        // node.
-        // E.g.:
-        //   SELECT avg(a) FROM t WHERE c = 1
+        // If `Filter` conditions are not exactly the same we can still try propagating up their
+        // differing condition because in some cases we will be able to merge them in an `Aggregate`
+        // parent node. E.g. we can merge:
+        //
+        // SELECT avg(a) FROM t WHERE c = 1
+        //
         // and:
-        //   SELECT sum(b) FROM t WHERE c = 2
-        // can be merged to:
-        // SELECT namedStruct(
-        //   'a', avg(a) FILTER (WHERE c = 1),
-        //   'b', sum(b) FILTER (WHERE c = 2)) AS mergedValue
+        //
+        // SELECT sum(b) FROM t WHERE c = 2
+        //
+        // into:
+        //
+        // SELECT
+        //   avg(a) FILTER (WHERE c = 1),
+        //   sum(b) FILTER (WHERE c = 2)
         // FORM t
         // WHERE c = 1 OR c = 2
         //
-        // Please note that depending on where the different `Filter`s reside in the plan and on
-        // which column the predicates are defined, we need to check the physical plan to make sure
-        // if `c` is not a partitioning or bucketing column and `c` is not present in pushed down
-        // filters. Otherwise the merged query can suffer performance degradation.
+        // But there are some sp2cial cases we need to consider:
+        //
+        // - The plans to be merged might contain multiple adjacent `Filter` nodes and the parent
+        //   `Filter` nodes should incorporate the propagated filters from child ones during merge.
+        //
+        //   E.g. adjacent filters can appear in plans when some of the optimization rules (like
+        //   `PushDownPredicates`) are disabled.
+        //
+        //   Let's consider we want to merge query 1:
+        //
+        //   SELECT avg(a)
+        //   FROM (
+        //     SELECT * FROM t WHERE c1 = 1
+        //   ) t
+        //   WHERE c2 = 1
+        //
+        //   and query 2:
+        //
+        //   SELECT sum(b)
+        //   FROM (
+        //     SELECT * FROM t WHERE c1 = 2
+        //   ) t
+        //   WHERE c2 = 2
+        //
+        //   then the optimal merged query is:
+        //
+        //   SELECT
+        //     avg(a) FILTER (WHERE c2 = 1 AND c1 = 1),
+        //     sum(b) FILTER (WHERE c2 = 2 AND c1 = 2)
+        //   FORM (
+        //     SELECT * FROM t WHERE c1 = 1 OR c1 = 2
+        //   ) t
+        //   WHERE (c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2)
+        //
+        //   This is because the `WHERE (c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2)` parent `Filter`
+        //   condition is more selective than a simple `WHERE c2 = 1 OR c2 = 2` would be as the
+        //   simple condition would let trough rows containing c1 = 1 and c2 = 2, which none of the
+        //   original queries do.
+        //
+        // - When we are merging plans to already merged plans the propagated filter conditions
+        //   could grow quickly, which we can avoid with tagging the already propagated filters.
+        //
+        //   E.g. if we merged the previous optimal merged query and query 3:
+        //
+        //   SELECT max(b)
+        //   FROM (
+        //     SELECT * FROM t WHERE c1 = 3
+        //   ) t
+        //   WHERE c2 = 3
+        //
+        //   then a new double-merged query would look like this:
+        //
+        //   SELECT
+        //     avg(a) FILTER (WHERE
+        //       (c2 = 1 AND c1 = 1) AND
+        //         ((c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2) AND (c1 = 1 OR c1 = 2))
+        //     ),
+        //     sum(b) FILTER (WHERE
+        //       (c2 = 2 AND c1 = 2) AND
+        //         ((c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2) AND (c1 = 1 OR c1 = 2))
+        //     ),
+        //     max(b) FILTER (WHERE c2 = 3 AND c1 = 3)
+        //   FORM (
+        //     SELECT * FROM t WHERE (c1 = 1 OR c1 = 2) OR c1 = 3
+        //   ) t
+        //   WHERE
+        //     ((c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2) AND (c1 = 1 OR c1 = 2)) OR
+        //       (c2 = 3 AND c1 = 3)
+        //
+        //   which is not optimal and contains unnecessary complex conditions.
+        //
+        //   Please note that `BooleanSimplification` and other rules could help simplifying filter
+        //   conditions, but when we merge large number if queries in this rule, the plan size can
+        //   increase exponentially and can cause memory issues before `BooleanSimplification` could
+        //   run.
+        //
+        //   But we can avoid that complexity if we tag already propagated filter conditions with a
+        //   simple `PropagatedFilter` wrapper during merge.
+        //   E.g. the actual merged query of query 1 and query 2 produced by this rule looks like
+        //   this:
+        //
+        //   SELECT
+        //     avg(a) FILTER (WHERE c2 = 1 AND c1 = 1),
+        //     sum(b) FILTER (WHERE c2 = 2 AND c1 = 2)
+        //   FORM (
+        //     SELECT * FROM t WHERE PropagatedFilter(c1 = 1 OR c1 = 2)
+        //   ) t
+        //   WHERE PropagatedFilter((c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2))
+        //
+        //   And so when we merge query 3 we know that filter conditions tagged with
+        //   `PropagatedFilter` can be ignored during filter propagation and thus the we get a much
+        //   simpler double-merged query:
+        //
+        //   SELECT
+        //     avg(a) FILTER (WHERE c2 = 1 AND c1 = 1),
+        //     sum(b) FILTER (WHERE c2 = 2 AND c1 = 2),
+        //     max(b) FILTER (WHERE c2 = 3 AND c1 = 3)
+        //   FORM (
+        //     SELECT * FROM t WHERE PropagatedFilter(PropagatedFilter(c1 = 1 OR c1 = 2) OR c1 = 3)
+        //   ) t
+        //   WHERE
+        //     PropagatedFilter(
+        //       PropagatedFilter((c2 = 1 AND c1 = 1) OR (c2 = 2 AND c1 = 2) OR
+        //       (c2 = 3 AND c1 = 3))
+        //
+        //   At the end of the rule we remove the `PropagatedFilter` wrappers.
         case (_, np: Filter, cp: Filter) =>
           tryMergePlans(np.child, cp.child, scanCheck).flatMap {
             case (mergedChild, outputMap, newChildFilter, mergedChildFilter) =>
@@ -459,29 +565,27 @@ object MergeScalarSubqueries extends Rule[LogicalPlan] {
   }
 
   /**
-   * - When we merge projection nodes (`Project` and `Aggregate`) we need to merge the named
-   * expression list coming from the new plan node into the expressions of the projection node of
-   * the merged child plan and return a merged list of expressions that will be placed into the
-   * merged projection node.
+   * Merges named expression lists of `Project` or `Aggregate` nodes of the new plan into the named
+   * expression list of a similar node of the cached plan.
+   *
    * - Before we can merge the new expressions, we need to take into account the propagated
-   * attribute mapping that describes the transformation from the input attributes the new plan's
-   * projection node to the input attributes of the merged child plan's projection node.
-   * - While merging the new expressions we need to build a new attribute mapping that describes
-   * the transformation from the output attributes of the new expressions to the output attributes
-   * of the merged list of expression.
-   * - If any filters are propagated from `Filter` nodes below, we need to transform the expressions
-   * to named expressions and merge them into the cached expressions as we did with new expressions.
+   * attribute mapping that describes the transformation from the input attributes of the new plan
+   * node to the output attributes of the already merged child plan node.
+   * - While merging the new expressions we need to build a new attribute mapping to propagate.
+   * - If any filters are propagated from `Filter` nodes below then we could add all the referenced
+   * attributes of filter conditions to the merged expression list, but it is better if we alias
+   * whole filter conditions and propagate only the new boolean attributes.
    *
-   * @param newExpressions the expressions of the new plan's projection node
-   * @param outputMap the propagated attribute mapping
-   * @param cachedExpressions the expressions of the cached plan's projection node
-   * @param newChildFilter the propagated filters from `Filter` nodes of the new plan
+   * @param newExpressions    the expression list of the new plan node
+   * @param outputMap         the propagated attribute mapping
+   * @param cachedExpressions the expression list of the cached plan node
+   * @param newChildFilter    the propagated filters from `Filter` nodes of the new plan
    * @param mergedChildFilter the propagated filters from `Filter` nodes of the merged child plan
    * @return A tuple of:
    *         - the merged expression list,
    *         - the new attribute mapping to propagate,
-   *         - the output attributes of the merged newChildFilter to propagate,
-   *         - the output attributes of the merged mergedChildFilter to propagate,
+   *         - the output attribute of the merged newChildFilter to propagate,
+   *         - the output attribute of the merged mergedChildFilter to propagate
    */
   private def mergeNamedExpressions(
       newExpressions: Seq[NamedExpression],