[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled #28900

viirya · 2020-06-23T02:54:57Z

What changes were proposed in this pull request?

This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled.

Why are the changes needed?

When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling.

Does this PR introduce any user-facing change?

Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions.

How was this patch tested?

Added unit test.

cloud-fan · 2020-06-23T04:39:27Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-  def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] = {
+  private def repartitionByExpression(
+      numPartitions: Option[Int],
+      partitionExprs: Column*): Dataset[T] = {


for internal method, we don't need to use var-length parameter list.

cloud-fan · 2020-06-23T04:40:08Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+
+  private def repartitionByRange(
+      numPartitions: Option[Int],
+      partitionExprs: Column*): Dataset[T] = {


cloud-fan · 2020-06-23T04:41:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

        exchange.ShuffleExchangeExec(
-          r.partitioning, planLater(r.child), canChangeNumPartitions = false) :: Nil
+          r.partitioning, planLater(r.child), canChangeNumPartitions = canChangeNumParts) :: Nil


now we have a variable name, we can just write r.partitioning, planLater(r.child), canChangeNumParts

cloud-fan · 2020-06-23T04:43:04Z

cc @maryannxue @JkSelf @koertkuipers

cloud-fan · 2020-06-23T05:57:26Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

        SQLConf.SHUFFLE_PARTITIONS.key -> "6",
        SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "7") {
-        val partitionsNum = spark.range(10).repartition($"id").rdd.collectPartitions().length
+        val df = spark.range(10).repartition($"id")


can we test repartition(numPartitions) in this test case and make sure the partition number doesn't change? Your new test case already test repartition by key/range.

SparkQA · 2020-06-23T07:05:02Z

Test build #124378 has finished for PR 28900 at commit 0a9223f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T07:05:02Z

Test build #124391 has finished for PR 28900 at commit 8e39ed7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T07:05:02Z

Test build #124387 has finished for PR 28900 at commit 43c4726.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-23T07:43:13Z

retest this please

cloud-fan · 2020-06-23T08:15:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+        }
+
+        val partitionsNum2 = df2.rdd.collectPartitions().length
+        assert(partitionsNum2 == 10)


nit: assert(df2.rdd.collectPartitions().length == 10)

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

SparkQA · 2020-06-23T13:24:53Z

Test build #124401 has finished for PR 28900 at commit 8e39ed7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T18:40:58Z

Test build #124424 has finished for PR 28900 at commit 4b9b0e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-24T03:43:48Z

Can we add the feature in ResolveCoalesceHints ? Hint can call repartition with default shuffle number.

viirya · 2020-06-24T04:46:47Z

Can we add the feature in ResolveCoalesceHints ? Hint can call repartition with default shuffle number.

Do you mean like SELECT /*+ COALESCE() */ ... ? When no partition number is not specified, let it be default partition number and AQE can coalesce it if enabled?

Seems currently the COALESCE hint doesn't allow default partition number usage. I'm not sure the reason about it.

cloud-fan · 2020-06-24T05:05:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

-        val partitionsNum = spark.range(10).repartition($"id").rdd.collectPartitions().length
+        val df1 = spark.range(10).repartition($"id")
+        val df2 = spark.range(10).repartition(10, $"id")
+        val df3 = spark.range(10).repartition(10)


repartitionByRange also takes numPartitions. Can we test it as well and check it doesn't coalesce?

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

ulysses-you · 2020-06-24T05:17:04Z

Seems currently the COALESCE hint doesn't allow default partition number usage. I'm not sure the reason about it.

I mean the repartition, such as this sql select /*+ repartition(col) */ * from test.

cloud-fan · 2020-06-24T06:09:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

@@ -1026,13 +1026,79 @@ class AdaptiveQueryExecSuite
    Seq(true, false).foreach { enableAQE =>


we can merge this test case to your two newly added test cases.

i.e. one test to test repartition, and it verifies both the initial partition number and the coalesced partition number. The other test tests the same thing but for repartitionByRange.

Yeah, merged them.

viirya · 2020-06-24T06:09:24Z

I mean the repartition, such as this sql select /*+ repartition(col) */ * from test.

Sounds reasonable to me. @cloud-fan WDYT?

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

cloud-fan · 2020-06-24T06:57:06Z

Yea, /*+ repartition(col) */ should also be supported by AQE

SparkQA · 2020-06-24T07:05:02Z

Test build #124461 has finished for PR 28900 at commit 7ceaebc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-24T07:05:02Z

Test build #124467 has finished for PR 28900 at commit df6a035.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

cloud-fan · 2020-06-24T08:54:18Z

LGTM. We can support /*+ repartition(col) */ with a followup PR.

SparkQA · 2020-06-24T21:44:15Z

Test build #124491 has finished for PR 28900 at commit 1ae1a87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

manuzhang · 2020-06-29T11:15:53Z

@viirya Can we support distribute by in SQL as well ?

cloud-fan · 2020-06-29T11:33:29Z

thanks, merging to master!

cloud-fan · 2020-06-29T11:34:19Z

@viirya please send a new PR to fix the SQL side, thanks!

viirya · 2020-06-29T16:12:57Z

@cloud-fan Thanks, will do it.

…nt and sql when AQE is enabled ### What changes were proposed in this pull request? As the followup of #28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. ### Why are the changes needed? When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. ### Does this PR introduce _any_ user-facing change? Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. ### How was this patch tested? Unit tests. Closes #28952 from viirya/SPARK-32056-sql. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… when AQE is enabled This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled. When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling. Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions. Added unit test. Closes apache#28900 from viirya/SPARK-32056. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nt and sql when AQE is enabled As the followup of apache#28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. Unit tests. Closes apache#28952 from viirya/SPARK-32056-sql. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… when AQE is enabled This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled. When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling. Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions. Added unit test. Closes apache#28900 from viirya/SPARK-32056. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nt and sql when AQE is enabled As the followup of apache#28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. Unit tests. Closes apache#28952 from viirya/SPARK-32056-sql. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… when AQE is enabled This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled. When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling. Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions. Added unit test. Closes apache#28900 from viirya/SPARK-32056. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nt and sql when AQE is enabled As the followup of apache#28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. Unit tests. Closes apache#28952 from viirya/SPARK-32056-sql. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… when AQE is enabled This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled. When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling. Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions. Added unit test. Closes apache#28900 from viirya/SPARK-32056. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nt and sql when AQE is enabled As the followup of apache#28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. Unit tests. Closes apache#28952 from viirya/SPARK-32056-sql. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Coalesce partitions for repartition by key when AQE is enabled.

0a9223f

probot-autolabeler bot added the SQL label Jun 23, 2020

cloud-fan reviewed Jun 23, 2020

View reviewed changes

Address comments.

43c4726

cloud-fan reviewed Jun 23, 2020

View reviewed changes

Add test.

8e39ed7

cloud-fan reviewed Jun 23, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

For comments.

4b9b0e8

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

For comments.

7ceaebc

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Show resolved Hide resolved

Refine test cases.

df6a035

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

Modify test name.

1ae1a87

cloud-fan closed this in 4204a63 Jun 29, 2020

viirya mentioned this pull request Jun 30, 2020

[SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled #28952

Closed

MGHawes mentioned this pull request May 16, 2021

Mh/cherry pick spark 33494 palantir/spark#762

Closed

LorenzoMartini mentioned this pull request May 18, 2021

Cherry Pick [SPARK-31220][SPARK-32056][SPARK-33494] palantir/spark#764

Merged

viirya deleted the SPARK-32056 branch December 27, 2023 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled #28900

[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled #28900

viirya commented Jun 23, 2020

cloud-fan Jun 23, 2020

cloud-fan Jun 23, 2020

cloud-fan Jun 23, 2020

cloud-fan commented Jun 23, 2020

cloud-fan Jun 23, 2020 •

edited

Loading

viirya Jun 23, 2020

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

cloud-fan commented Jun 23, 2020

cloud-fan Jun 23, 2020

viirya Jun 23, 2020

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

ulysses-you commented Jun 24, 2020

viirya commented Jun 24, 2020

cloud-fan Jun 24, 2020

viirya Jun 24, 2020

ulysses-you commented Jun 24, 2020

cloud-fan Jun 24, 2020 •

edited

Loading

cloud-fan Jun 24, 2020

viirya Jun 24, 2020

viirya commented Jun 24, 2020

cloud-fan commented Jun 24, 2020

SparkQA commented Jun 24, 2020

SparkQA commented Jun 24, 2020

cloud-fan commented Jun 24, 2020 •

edited

Loading

SparkQA commented Jun 24, 2020

manuzhang commented Jun 29, 2020

cloud-fan commented Jun 29, 2020

cloud-fan commented Jun 29, 2020

viirya commented Jun 29, 2020

		@@ -1026,13 +1026,79 @@ class AdaptiveQueryExecSuite
		Seq(true, false).foreach { enableAQE =>

[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled #28900

[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled #28900

Conversation

viirya commented Jun 23, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 23, 2020

cloud-fan Jun 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

cloud-fan commented Jun 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 23, 2020

SparkQA commented Jun 23, 2020

ulysses-you commented Jun 24, 2020

viirya commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulysses-you commented Jun 24, 2020

cloud-fan Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jun 24, 2020

cloud-fan commented Jun 24, 2020

SparkQA commented Jun 24, 2020

SparkQA commented Jun 24, 2020

cloud-fan commented Jun 24, 2020 • edited Loading

SparkQA commented Jun 24, 2020

manuzhang commented Jun 29, 2020

cloud-fan commented Jun 29, 2020

cloud-fan commented Jun 29, 2020

viirya commented Jun 29, 2020

cloud-fan Jun 23, 2020 •

edited

Loading

cloud-fan Jun 24, 2020 •

edited

Loading

cloud-fan commented Jun 24, 2020 •

edited

Loading