[SPARK-12213] [SQL] use multiple partitions for single distinct query #10228

davies · 2015-12-09T19:13:26Z

Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other
works better for high cardinality column (default one).

This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag spark.sql.specializeSingleDistinctAggPlanning (introduced in 1.6).

For a query like SELECT COUNT(DISTINCT a) FROM table will be

AGG-4 (count distinct)
  Shuffle to a single reducer
    Partial-AGG-3 (count distinct, no grouping)
      Partial-AGG-2 (grouping on a)
        Shuffle by a
          Partial-AGG-1 (grouping on a)

This PR also includes large refactor for aggregation (reduce 500+ lines of code)

cc @yhuai @nongli @marmbrus

SparkQA · 2015-12-09T19:18:58Z

Test build #47442 has started for PR 10228 at commit e3f2e79.

SparkQA · 2015-12-09T19:23:50Z

Test build #47443 has started for PR 10228 at commit 5e42c76.

yhuai · 2015-12-09T19:28:04Z

oh, just realized that the plan for a query like SELECT COUNT(DISTINCT a) FROM table will be

AGG-2 (count distinct)
  Shuffle to a single reducer
    AGG-1 (grouping on a)
      Shuffle by a
        Partial-AGG-1 (grouping on a)

Ideally, we should still use four aggregate operators like the one shown below but without the overhead of using Expand.

AGG-2 (count distinct)
  Shuffle to a single reducer
    Partial-AGG-2 (count distinct)
      AGG-1 (grouping on a)
        Shuffle by a
          Partial-AGG-1 (grouping on a)

hvanhovell · 2015-12-10T17:42:32Z

We could move the planning of a distinct queries entirely to the DistinctAggregateRewriter. This would require us to merge the non-distinct aggregate paths and the first distinct group aggregate path, so we could avoid the expand in case of a single disinct column group.

This is quite a bit of work; I don't know if this is worth the effort.

SparkQA · 2015-12-10T19:28:45Z

Test build #47533 has finished for PR 10228 at commit 3f60962.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-10T19:54:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+    val aggregationBufferSchema = aggregateFunctions.flatMap(_.aggBufferAttributes)
+    val modes = aggregateExpressions.map(_.mode).distinct
+    if (aggregateExpressions.nonEmpty) {
+      val inputAggregationBufferSchema = if (initialInputBufferOffset == 0) {


Why doesn't this just check groupingKeyAttributes.nonEmpty?

SparkQA · 2015-12-10T20:58:00Z

Test build #47536 has finished for PR 10228 at commit a9eae30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-12-10T21:48:06Z

@hvanhovell The difficulty of doing this in DistinctAggregateRewriter is that DistinctAggregateRewriter will generate two logical plan, but some aggregation functions have different updateExpression and mergeExpression, so will could not work as update-merge-update-final, they should work as update-merge-merge-final.

hvanhovell · 2015-12-10T22:13:58Z

@davies don't get me wrong. I think this PR is an improvement of the current situation (it never crossed my mind to change partitioning when I was working on that part of the code), and should be added.

I am also not to keen on changing the MultipleDistinctRewriter; given the time it'll take and the objections you've raised. The only thing that bugs me is, is that we currently rewrite distinct aggregates in two places, and I was thinking (out-loud) about a potential solution.

davies · 2015-12-10T22:34:43Z

@hvanhovell If we could figure out a better solution, it's definitely welcomed.

SparkQA · 2015-12-10T23:19:09Z

Test build #47544 has finished for PR 10228 at commit 3f1ea7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-11T02:46:05Z

Test build #47559 has finished for PR 10228 at commit 8262ad8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-11T05:41:12Z

Test build #47570 has finished for PR 10228 at commit 740e725.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-12T00:09:25Z

Test build #2211 has finished for PR 10228 at commit 740e725.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-12T00:30:48Z

Test build #47601 has finished for PR 10228 at commit 71e0b1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-12T00:32:29Z

Test build #2212 has finished for PR 10228 at commit 71e0b1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-12-13T03:57:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-  protected val allAggregateFunctions: Array[AggregateFunction] = {
+  protected def initializeAggregateFunctions(
+    expressions: Seq[AggregateExpression],
+    startingInputBufferOffset: Int): Array[AggregateFunction] = {


yhuai · 2015-12-13T03:57:25Z

...re/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala

-      case other =>
-        throw new IllegalStateException(
-          s"${aggregationMode} should not be passed into TungstenAggregationIterator.")
+  override def generateResultProjection(): (UnsafeRow, MutableRow) => UnsafeRow = {


override protected

yhuai · 2015-12-13T04:01:39Z

@hvanhovell With this change, we will use the planner rule to handle single distinct aggregation and use the rewriter to handle multiple distinct aggregations, which is the same as when you originally introduced the rewriter. I guess the compilation logic after this change is better than our current logic (having two different rules that handle the same case). What do you think?

yhuai · 2015-12-13T04:02:18Z

@davies I only left a few minor comments. Overall, it is very cool!

hvanhovell · 2015-12-13T11:59:53Z

@yhuai I think having the two clearly separated paths (this PR) is an improvement of the current situation. I also admit that I am responsible for introducing the second path. Your comment on having four aggregate steps without the exhange triggered me, and I was thinking out loud on how we could do this using the rewriting rule (the removal of one of the paths would have been a bonus).

hvanhovell · 2015-12-13T13:10:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-            i += 1
+    val joinedRow = new JoinedRow
+    if (aggregateExpressions.nonEmpty) {
+      val mergeExpressions = aggregateFunctions.zipWithIndex.flatMap {


zip(aggregateExpressions) ?

SparkQA · 2015-12-13T14:37:16Z

Test build #47622 has finished for PR 10228 at commit 51ca055.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-12-13T18:03:53Z

@hvanhovell Yea, it's great to think about how to use a single rule to handle aggregation queries with distinct after we have this improvement. The logical rewriter rules probably is a good place because rewriting logical plans is easier. If it is the right approach, we can make some changes to our physical planner to make it respect the aggregation mode of an agg expression in a logical agg operator (right now, our physical planner always ignore the mode). So, when we create physical plan, we can understand that, for example, a logical agg operator is used to merge aggregation buffers.

yhuai · 2015-12-14T00:19:31Z

@hvanhovell How about we merge this first and we take a look at how to use a single rule to handle aggregation queries with distinct?

hvanhovell · 2015-12-14T06:29:12Z

@yhuai LGTM. Yea, lets merge this one, I'll create a ticket for the distinct rules

yhuai · 2015-12-14T06:54:45Z

Cool. I am merging this one to master.

Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other works better for high cardinality column (default one). This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6). For a query like `SELECT COUNT(DISTINCT a) FROM table` will be ``` AGG-4 (count distinct) Shuffle to a single reducer Partial-AGG-3 (count distinct, no grouping) Partial-AGG-2 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on a) ``` This PR also includes large refactor for aggregation (reduce 500+ lines of code) cc yhuai nongli marmbrus Author: Davies Liu <davies@databricks.com> Closes apache#10228 from davies/single_distinct.

use multiple partitions for single distinct query

e3f2e79

update test

5e42c76

Davies Liu added 2 commits December 10, 2015 10:58

use four aggregate

3f60962

fix tungsten aggregate

a9eae30

nongli reviewed Dec 10, 2015
View reviewed changes

refactor, output UnsafeRow in SortBasedAggregate

3f1ea7f

davies force-pushed the single_distinct branch from 8262ad8 to 740e725 Compare December 11, 2015 05:16

fix bug

71e0b1c

davies force-pushed the single_distinct branch from 740e725 to 71e0b1c Compare December 11, 2015 22:44

yhuai reviewed Dec 13, 2015
View reviewed changes

address comments

51ca055

hvanhovell reviewed Dec 13, 2015
View reviewed changes

asfgit closed this in 834e714 Dec 14, 2015

mbautin mentioned this pull request Feb 1, 2016

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query alteryx/spark#148

Merged

cloud-fan mentioned this pull request Dec 4, 2019

[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression #26656

Closed

[SPARK-12213] [SQL] use multiple partitions for single distinct query #10228

[SPARK-12213] [SQL] use multiple partitions for single distinct query #10228

Uh oh!

Conversation

davies commented Dec 9, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

yhuai commented Dec 9, 2015

Uh oh!

hvanhovell commented Dec 10, 2015

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

nongli Dec 10, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

davies commented Dec 10, 2015

Uh oh!

hvanhovell commented Dec 10, 2015

Uh oh!

davies commented Dec 10, 2015

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

yhuai Dec 13, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Dec 13, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 13, 2015

Uh oh!

yhuai commented Dec 13, 2015

Uh oh!

hvanhovell commented Dec 13, 2015

Uh oh!

hvanhovell Dec 13, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 13, 2015

Uh oh!

yhuai commented Dec 13, 2015

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

hvanhovell commented Dec 14, 2015

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

Uh oh!