[SPARK-29544] [SQL] optimize skewed partition based on data size #26434

JkSelf · 2019-11-08T06:47:11Z

What changes were proposed in this pull request?

Skew Join is common and can severely downgrade performance of queries, especially those with joins. This PR aim to optimization the skew join based on the runtime Map output statistics by adding "OptimizeSkewedPartitions" rule. And The details design doc is here. Currently we can support "Inner, Cross, LeftSemi, LeftAnti, LeftOuter, RightOuter" join type.

Why are the changes needed?

To optimize the skewed partition in runtime based on AQE

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

We have test this PR with the following sql in 5 node clusters.

spark.range(0, 100000, 1, 6).selectExpr("id % 2 as key1").createOrReplaceTempView("test1");
spark.range(0, 100000, 1, 6).selectExpr("id % 2 +1 as key2").createOrReplaceTempView("test2");
import org.apache.spark.sql.SaveMode.Overwrite
spark.sql("select * from test1, test2 where key1 = key2").write.format("noop").mode(Overwrite).save()

The main spark configuration is:

spark.sql.shuffle.partitions 500
spark.sql.autoBroadcastJoinThreshold -1
spark.sql.adaptive.enabled true
spark.sql.adaptive.shuffle.localShuffleReader.enabled false
spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled false
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold 500

This PR can gain about 6x performance improvement(27s vs 162s). And the following is the UI of with and without this PR.
With this PR:

Without this PR:

JkSelf · 2019-11-08T06:47:51Z

@cloud-fan Please help me review. Thanks for your help.

SparkQA · 2019-11-08T06:50:00Z

Test build #113434 has finished for PR 26434 at commit c64df90.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OptimizeSkewedPartitions(conf: SQLConf) extends Rule[SparkPlan]
case class SkewedShuffleReaderExec(
class SkewedShuffledRowRDD(

SparkQA · 2019-11-08T07:03:48Z

Test build #113435 has finished for PR 26434 at commit f0f03c5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-08T08:05:01Z

Test build #113444 has finished for PR 26434 at commit 5f56f89.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

JkSelf · 2019-11-08T08:10:26Z

Please help to retest. Thanks.

maropu · 2019-11-09T00:59:43Z

retest this please

SparkQA · 2019-11-09T03:42:15Z

Test build #113484 has finished for PR 26434 at commit 5f56f89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-11T17:08:02Z

core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala

+   * Get a reader for the specific partitionIndex in map output statistics that are
+   * produced by range mappers. Called on executors by reduce tasks.
+   */
+  def getReaderForRangeMapper[K, C](


we can probably merge this method with getReaderForOneMapper. one mapper is just a special range mappers.

cloud-fan · 2019-11-11T17:42:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+  }
+}
+
+case class SkewedShuffleReaderExec(


This is different from local/coalesced shuffle reader as it reads only one reduce partition. Maybe better to call it PostShufflePartitionReader

cloud-fan · 2019-11-11T17:48:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+    val size = metrics.bytesByPartitionId(partitionId)
+    val factor = size / medianSize
+    val numMappers = getShuffleStage(stage).
+      plan.shuffleDependency.rdd.partitions.length


we can easily get the data size of each mapper, shall we split mapper ranges based on data size?

cloud-fan · 2019-11-11T17:49:33Z

sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala

@@ -279,6 +279,24 @@ private[sql] trait SQLTestData { self =>
    df
  }

+  protected lazy val skewData1: DataFrame = {


let's define the data in where they are used. This file should only contains dataframes that are being used by multiple test suites.

SparkQA · 2019-11-12T11:52:17Z

Test build #113622 has finished for PR 26434 at commit bbf585d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PostShufflePartitionReader(

cloud-fan · 2019-11-12T17:43:21Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+   *         and the second item is a sequence of (shuffle block id, shuffle block size, map index)
+   *         tuples describing the shuffle blocks that are stored at that block manager.
+   */
+  def convertMapStatuses(


can we merge the two convertMapStatuses?

cloud-fan · 2019-11-12T17:44:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala

@@ -116,7 +116,8 @@ class CoalescedPartitioner(val parent: Partitioner, val partitionStartIndices: A
 class ShuffledRowRDD(
    var dependency: ShuffleDependency[Int, InternalRow, InternalRow],
    metrics: Map[String, SQLMetric],
-    specifiedPartitionStartIndices: Option[Array[Int]] = None)
+    specifiedPartitionStartIndices: Option[Array[Int]] = None,


how about Option[Array[(Int, Int)]]?

maybe the separate definition to specifiedPartitionStartIndices and specifiedPartitionEndIndices more clear？

Option[Array[(Int, Int)]] is more type-safe. It eliminates problems like

specifiedPartitionStartIndices is specified but specifiedPartitionEndIndices is not.

these 2 have different lengths.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

cloud-fan · 2019-11-12T17:55:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+                partitionId, rightMapIdStartIndices(j), rightEndMapId)
+
+            subJoins +=
+              SortMergeJoinExec(leftKeys, rightKeys, joinType, condition,


just for brainstorm: here we are joining 2 partitions, not 2 RDDs, so there will be no shuffle. Is it better to run hash join than SMJ? cc @maryannxue

Furthermore, if only one side is skew, maybe better to plan a broadcast hash join instead of a union of many SMJs?

JkSelf · 2019-12-03T02:50:21Z

@cloud-fan fix the conflicts and resolve the comments. Please help review again. Thanks for your help.

manuzhang · 2019-12-03T04:26:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+    var postMapPartitionSize: Long = mapPartitionSize(i)
+    partitionStartIndices += i
+    while (i < numMappers && i + 1 < numMappers) {
+      val nextIndex = if (i + 1 < numMappers) {


Isn't this always true ?

@manuzhang Thanks for your review. Offline discussion with wenchen, we decided to remove this method. And split the skewed partition with the number of mappers.

manuzhang · 2019-12-03T04:33:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+        i + 1
+      } else numMappers -1
+
+      if (postMapPartitionSize + mapPartitionSize(nextIndex) > advisoryTargetPostShuffleInputSize) {


What if this never comes true when adaptiveSkewedSizeThreshold is smaller than targetPostShuffleInputSize ? I'm also wondering whether targetPostShuffleInputSize can be reused for the threshold.

SparkQA · 2019-12-03T05:24:21Z

Test build #114751 has finished for PR 26434 at commit 18cdcd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-03T11:22:34Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

        }
      case None =>
        Iterator.empty
    }
  }

+
+


nit: unnecessary change

cloud-fan · 2019-12-03T11:24:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -410,6 +410,27 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val ADAPTIVE_EXECUTION_SKEWED_JOIN_ENABLED = buildConf("spark.sql.adaptive.skewedJoin.enabled")


nit: spark.sql.adaptive.optimizeSkewedJoin.enabled

cloud-fan · 2019-12-03T11:25:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.adaptive.skewedPartitionFactor")
+      .doc("A partition is considered as a skewed partition if its size is larger than" +
+        " this factor multiple the median partition size and also larger than " +
+        "spark.sql.adaptive.skewedPartitionSizeThreshold.")


nit: ${ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD.key} instead of hardcode

cloud-fan · 2019-12-03T11:25:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  def adaptiveSkewedFactor: Int = getConf(ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR)
+
+  def adaptiveSkewedSizeThreshold: Long =


nit: we don't need these 3 methods if they are only called 1 or 2 times.

cloud-fan · 2019-12-03T12:29:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+    val shuffleStageCheck = ShuffleQueryStageExec.isShuffleQueryStageExec(leftStage) &&
+      ShuffleQueryStageExec.isShuffleQueryStageExec(rightStage)
+    val statisticsReady: Boolean = if (shuffleStageCheck) {
+      getStatistics(leftStage) != null && getStatistics(rightStage) != null


how can the stats be null?

it seems not be null. Wrong understanding about the leftStage and rightStage may not done simultaneously.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

cloud-fan · 2019-12-03T12:34:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala

+    joinTypeSupported && statisticsReady
+  }
+
+  private def supportSplitOnLeftPartition(joinType: JoinType) = joinType != RightOuter


nit: it's more robust to list the supported join types.

JkSelf · 2019-12-09T05:40:51Z

@cloud-fan updated the comments online and offline. Please help me review again. Thanks.

SparkQA · 2020-01-13T14:07:09Z

Test build #4991 has finished for PR 26434 at commit cee1c8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-13T16:40:33Z

Test build #4992 has finished for PR 26434 at commit cee1c8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-13T16:57:06Z

retest this please

SparkQA · 2020-01-13T19:24:33Z

Test build #116654 has finished for PR 26434 at commit cee1c8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

create partial shuffle reader

SparkQA · 2020-01-14T10:49:05Z

Test build #116695 has finished for PR 26434 at commit ac17a7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-14T12:31:57Z

thanks, merging to master!

hvanhovell · 2020-01-14T13:12:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    val shuffleId = stage.shuffle.shuffleDependency.shuffleHandle.shuffleId
+    val mapPartitionSizes = getMapSizesForReduceId(shuffleId, partitionId)
+    val maxSplits = math.min(conf.getConf(
+      SQLConf.ADAPTIVE_EXECUTION_SKEWED_PARTITION_MAX_SPLITS), mapPartitionSizes.length)


Why do we need this config? It seems a bit weird that we try to use the actual size everywhere else, and we are not doing it here. I think that just using the SQLConf.ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD for the target partition size by itself should yield good results.

According to the feedback in @manuzhang environment. In some use case, the split number may be very large (more than 1000). There will be too many smjs and it may take a long time to launch job after optimizing skewed join. The ui will also be choked by the huge praph. So we add this configuration as a upper limit of the split number.

hvanhovell · 2020-01-14T13:25:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    val partitionStartIndices = ArrayBuffer[Int]()
+    var postMapPartitionSize = mapPartitionSizes(0)
+    partitionStartIndices += 0
+    partitionIndices.drop(1).foreach { nextPartitionIndex =>


In some case writing a while loop is actually easier to understand, e.g.:

var postMapPartitionSize = advisoryPartitionSize + 1 var nextPartitionIndex = 0 while (nextPartitionIndex < mapPartitionSizes.length) { val nextMapPartitionSize = mapPartitionSizes(nextPartitionIndex) if (postMapPartitionSize + nextMapPartitionSize > advisoryPartitionSize) { partitionStartIndices += nextPartitionIndex postMapPartitionSize = 0 } postMapPartitionSize += nextMapPartitionSize nextPartitionIndex += 1 }

Ok. I will update later.

hvanhovell · 2020-01-14T13:38:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      }
+    }
+
+    if (partitionStartIndices.size > maxSplits) {


This makes the last partition larger right? Isn't that adding some skew?

Yes, it may cause the last partition be skewed. This approach can not solve the extreme skewed use case.

hvanhovell · 2020-01-14T13:47:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/SkewedShuffledRowRDD.scala

+ * This RDD takes a [[ShuffleDependency]] (`dependency`), a partitionIndex
+ * and the range of startMapIndex to endMapIndex.
+ */
+class SkewedShuffledRowRDD(


Why do we need a separate implementation of the shuffled row RDD? I am wondering if we can combine them all, and have a couple of partition implementations depending on which (mapper/reducer) coordinate we need to read from.

BTW bifurcation of the query plan (part hash join, part smj) is slightly orthogonal to this.

This sounds like a good idea.

hvanhovell · 2020-01-14T13:54:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+  private def medianSize(stats: MapOutputStatistics): Long = {
+    val numPartitions = stats.bytesByPartitionId.length
+    val bytes = stats.bytesByPartitionId.sorted
+    if (bytes(numPartitions / 2) > 0) bytes(numPartitions / 2) else 1


math.min(bytes(numPartitions / 2), 1)?

OCD: you could argue that this median calculation is incorrect for an even number of elements. In that case it should be (bytes(numPartitions / 2) + bytes((numPartitions + 1) / 2)) / 2

Good catch. And I will updated later.

hvanhovell · 2020-01-14T13:55:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+
+      val leftMedSize = medianSize(leftStats)
+      val rightMedSize = medianSize(rightStats)
+      val leftSizeInfo = s"median size: $leftMedSize, max size: ${leftStats.bytesByPartitionId.max}"


NIT you can avoid materializing the string if debug logging is not enabled by making these defs.

Ok. I will update later.

hvanhovell · 2020-01-14T13:57:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    val shuffleStages = collectShuffleStages(plan)
+
+    if (shuffleStages.length == 2) {
+      // Currently we only support handling skewed join for 2 table join.


Is this being planned?

Is it that we want to avoid things like:

SMJ SMJ (w/ or w/o Sort) Shuffle (w/ or w/o Sort) Shuffle (w/ or w/o Sort) Shuffle (w/ or w/o Sort)

i.e., A SMJ that contains a child SMJ without a Shuffle on top, so we don't wanna optimize the child SMJ coz it changes the outputPartitioning? If so, can we make that clearer in the comment?

and since it affects the outputPartitioning, we need to check if it breaks the operators above.

There are scenario's where we can do this for multiple joins right? INNER for example.

It may be too complex to optimize the multiple joins. And we optimize 2 table join firstly. We can further optimize the multiple joins later.

@maryannxue yes, we need to consider the effect of outputPartitioning. And I will updated in the follwing PRs.

hvanhovell · 2020-01-14T14:03:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    stage.shuffle.shuffleDependency.rdd.partitions.length
+  }
+
+  def handleSkewJoin(plan: SparkPlan): SparkPlan = plan.transformUp {


In general it would be good to have some documentation for this method. What you are doing here is not completely trivial :). The same applies for the code inside the method.

Does the design doc in "Idea" section can explain?

Can we do a summary and put it as code comment?

Yes. I will add later.

hvanhovell · 2020-01-14T14:09:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      }
+      logDebug(s"number of skewed partitions is ${skewedPartitions.size}")
+      if (skewedPartitions.nonEmpty) {
+        val optimizedSmj = smj.transformDown {


You can just rewrite the SMJ directly right? The alternative is that you use transform down and rewrite the ShuffleQueryStageExec's directly.

Yes. I will update later.

maryannxue · 2020-01-14T14:40:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -87,6 +87,10 @@ case class AdaptiveSparkPlanExec(
  // optimizations should be stage-independent.
  @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
    ReuseAdaptiveSubquery(conf, context.subqueryCache),
+    // Here the 'OptimizeSkewedPartitions' rule should be executed


comment out-of-date

Ok. I Will update later.

hvanhovell · 2020-01-14T14:41:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+ * @param startMapIndex The start map index.
+ * @param endMapIndex The end map index.
+ */
+case class SkewedPartitionReaderExec(


Like with the RDDs: In general I would be in favor of creating one reader node that can deal with the different kinds of shuffle reads. That avoid a sprawl of readers, and it also allows us to create a much simpler plan if we can just use 1 reader with a join instead of using a union.

Yes. We also plan to further optimize the skew reader to make the plan simpler later.

hvanhovell · 2020-01-14T14:44:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

@@ -579,6 +579,153 @@ class AdaptiveQueryExecSuite
    }
  }

+  test("SPARK-29544: adaptive skew join with different join types") {


Can we do something to increase coverage for different join types?

maryannxue · 2020-01-14T14:52:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+
+  def handleSkewJoin(plan: SparkPlan): SparkPlan = plan.transformUp {
+    case smj @ SortMergeJoinExec(leftKeys, rightKeys, joinType, condition,
+        s1 @ SortExec(_, _, left: ShuffleQueryStageExec, _),


we don't necessarily have a Sort in SMJ.

Yes. We may can also optimize the following use case:

SMJ Shuffle Shuffle

hvanhovell · 2020-01-14T14:56:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    }
+
+    def collectShuffleStages(plan: SparkPlan): Seq[ShuffleQueryStageExec] = plan match {
+      case _: LocalShuffleReaderExec => Nil


We could loose an optimization opportunity here. You could have a situation where we converted a SMJ->BHJ and where there is a shuffled join on top of this. Anyway this will require some more changes.

This might even be a correctness problem if the LocalShuffleReader produces a partitioning that is actually leveraged later in the stage.

Good idea. We can do further optimization later.

We can remove these "reader" cases here:

this rule applies first, so we won't see these readers anyway.

even if this rule is not the first to apply, we should not skip them otherwise we could miscount. we are not matching them in transform anyway, so we are safe to ignore them here.

hvanhovell · 2020-01-14T15:31:41Z

@JkSelf I am very excited about this work, this will improve the end-user experience for a lot of users. I have left some additional comments; I hope you don't mind that I am late to the party.

…in optimizations  ### What changes were proposed in this pull request?  This is a followup of #26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed?  The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change?  no ### How was this patch tested?  existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53] : +- *(1) Project [(id#0L % 2) AS key1#2L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 100000, step=1, splits=6) +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64] +- *(2) Project [((id#4L % 2) + 1) AS key2#6L] +- *(2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes #27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com>

…in optimizations  ### What changes were proposed in this pull request?  This is a followup of #26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed?  The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change?  no ### How was this patch tested?  existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53] : +- *(1) Project [(id#0L % 2) AS key1#2L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 100000, step=1, splits=6) +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64] +- *(2) Project [((id#4L % 2) + 1) AS key2#6L] +- *(2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes #27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit a4ceea6) Signed-off-by: herman <herman@databricks.com>

…in optimizations  ### What changes were proposed in this pull request?  This is a followup of apache#26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed?  The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change?  no ### How was this patch tested?  existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=apache#53] : +- *(1) Project [(id#0L % 2) AS key1#2L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 100000, step=1, splits=6) +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=apache#64] +- *(2) Project [((id#4L % 2) + 1) AS key2#6L] +- *(2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes apache#27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com>

dongjoon-hyun added the SQL label Nov 9, 2019

cloud-fan reviewed Nov 11, 2019

View reviewed changes

cloud-fan reviewed Nov 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 12, 2019

View reviewed changes

JkSelf force-pushed the skewedPartitionBasedSize branch from bbf585d to 18cdcd9 Compare December 3, 2019 02:48

manuzhang reviewed Dec 3, 2019

View reviewed changes

cloud-fan reviewed Dec 3, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedPartitions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 3, 2019

View reviewed changes

cloud-fan and others added 2 commits January 14, 2020 16:23

create partial shuffle reader

4abad37

Merge pull request #3 from cloud-fan/help

ac17a7c

create partial shuffle reader

cloud-fan closed this in a2aa966 Jan 14, 2020

hvanhovell reviewed Jan 14, 2020

View reviewed changes

maryannxue reviewed Jan 14, 2020

View reviewed changes

hvanhovell reviewed Jan 14, 2020

View reviewed changes

maryannxue reviewed Jan 14, 2020

View reviewed changes

hvanhovell reviewed Jan 14, 2020

View reviewed changes

cloud-fan mentioned this pull request Feb 7, 2020

[SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations #27493

Closed


		def adaptiveSkewedFactor: Int = getConf(ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR)

		def adaptiveSkewedSizeThreshold: Long =

[SPARK-29544] [SQL] optimize skewed partition based on data size #26434

[SPARK-29544] [SQL] optimize skewed partition based on data size #26434

Uh oh!

Conversation

JkSelf commented Nov 8, 2019 • edited by cloud-fan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

JkSelf commented Nov 8, 2019

Uh oh!

SparkQA commented Nov 8, 2019

Uh oh!

SparkQA commented Nov 8, 2019

Uh oh!

SparkQA commented Nov 8, 2019

Uh oh!

JkSelf commented Nov 8, 2019

Uh oh!

maropu commented Nov 9, 2019

Uh oh!

SparkQA commented Nov 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JkSelf commented Dec 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JkSelf commented Dec 9, 2019

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

cloud-fan commented Jan 13, 2020

JkSelf commented Nov 8, 2019 •

edited by cloud-fan

Loading

hvanhovell Jan 14, 2020 •

edited

Loading

hvanhovell Jan 14, 2020 •

edited

Loading