[SPARK-11517][SQL]Calc partitions in parallel for multiple partitions table #9483

zhichao-li · 2015-11-05T02:10:44Z

Currently we calculate the getPartitions for each "hive partition" in sequence way, it would be faster if we can parallel this on driver side

zhichao-li · 2015-11-05T02:11:29Z

cc @chenghao-intel

SparkQA · 2015-11-05T04:50:09Z

Test build #45083 has finished for PR 9483 at commit 63dc9c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ParallelUnionRDD[T: ClassTag](\n

chenghao-intel · 2015-11-05T07:06:08Z

cc/ @scwf @Sephiroth-Lin, not sure if you guys get time for benchmarking this with the real world cases.

zhonghaihua · 2015-11-15T13:14:32Z

Hi @zhichao-li ,thanks for doing this.I got a problem of scanning partitions slowly,and I apply this patch to my spark version.In my case:

Before I apply this patch,it takes at least 3 or 4 minutes to scan partitions.
After applying this patch,it takes only about 20 seconds at this stage.

I am happy to see it takes effect in my case.It solve my problem.And I think is it better to add conf to control whether to use this feature？

yhuai · 2015-11-18T22:18:32Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/ParallelUnionRDD.scala

+  rdds: Seq[RDD[T]]) extends UnionRDD[T](sc, rdds){
+  // TODO: We might need to guess a more reasonable thread pool size here
+  @transient val executorService = ThreadUtils.newDaemonFixedThreadPool(
+    Math.min(rdds.size, Runtime.getRuntime.availableProcessors()), "ParallelUnionRDD")


Should we share the single thread pool instead of creating a thread pool for every ParallelUnionRDD?

I don't have strong opinion on this. How about creating a shared thread pool with the same size as cpu cores ?

object ParallelUnionRDD{ val executorService = ThreadUtils.newDaemonFixedThreadPool(Runtime.getRuntime.availableProcessors(), "ParallelUnionRDD") }

I don't think we have to put the fixed number of Runtime.getRuntime.availableProcessors(), probably we can simply put a fixed number says 16 or even bigger, as the bottleneck is in network / IO, not the CPU scheduling.

zhichao-li · 2016-01-18T07:14:35Z

@yhuai @rxin , any thoughts or concerns for this PR? It's common that one table contains tons of partitions(i.e every 15mins a partition for clicking data).

chenghao-intel · 2016-02-17T06:15:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/ParallelUnionRDD.scala

+      }))
+    }.map {case(r, f) => (r, f.get())}
+
+    val array = new Array[Partition](rddPartitions.map(_._2.length).sum)


seems here still be the main thread, probably we even don't need to place the synchronized in the getPartitions.

SparkQA · 2016-02-24T07:37:55Z

Test build #51856 has finished for PR 9483 at commit 6456f12.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-24T09:19:07Z

Test build #51861 has finished for PR 9483 at commit db84ab9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2016-02-25T02:07:10Z

retest this please

SparkQA · 2016-02-25T03:38:12Z

Test build #51920 has finished for PR 9483 at commit db84ab9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2016-02-26T20:20:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/ParallelUnionRDD.scala

+import org.apache.spark.rdd.{RDD, UnionPartition, UnionRDD}
+import org.apache.spark.util.ThreadUtils
+
+object ParallelUnionRDD {


private[hive] or move it into the upper level package? The same for the class ParallelUnionRDD.

chenghao-intel · 2016-02-26T20:27:57Z

LGTM except some minor suggestions.

SparkQA · 2016-02-29T02:26:26Z

Test build #52152 has finished for PR 9483 at commit fdac95b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2016-02-29T03:11:38Z

retest this please

SparkQA · 2016-02-29T04:32:32Z

Test build #52156 has finished for PR 9483 at commit fdac95b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2016-03-01T02:16:29Z

retest this please

SparkQA · 2016-03-01T03:49:53Z

Test build #52210 has finished for PR 9483 at commit fdac95b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-15T22:27:13Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket.

yhuai reviewed Nov 18, 2015
View reviewed changes

chenghao-intel reviewed Feb 17, 2016
View reviewed changes

zhichao-li added 2 commits February 24, 2016 10:30

parallel

6d55393

address comment

6456f12

zhichao-li force-pushed the parallelUnionRDD branch from 63dc9c0 to 6456f12 Compare February 24, 2016 07:19

style

db84ab9

chenghao-intel reviewed Feb 26, 2016
View reviewed changes

style

fdac95b

asfgit closed this in 1a33f2e Jun 15, 2016

[SPARK-11517][SQL]Calc partitions in parallel for multiple partitions table #9483

[SPARK-11517][SQL]Calc partitions in parallel for multiple partitions table #9483

Uh oh!

Conversation

zhichao-li commented Nov 5, 2015

Uh oh!

zhichao-li commented Nov 5, 2015

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!

chenghao-intel commented Nov 5, 2015

Uh oh!

zhonghaihua commented Nov 15, 2015

Uh oh!

yhuai Nov 18, 2015

Choose a reason for hiding this comment

Uh oh!

zhichao-li Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Feb 17, 2016

Choose a reason for hiding this comment

Uh oh!

zhichao-li commented Jan 18, 2016

Uh oh!

chenghao-intel Feb 17, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 24, 2016

Uh oh!

SparkQA commented Feb 24, 2016

Uh oh!

zhichao-li commented Feb 25, 2016

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

chenghao-intel Feb 26, 2016

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Feb 26, 2016

Uh oh!

SparkQA commented Feb 29, 2016

Uh oh!

zhichao-li commented Feb 29, 2016

Uh oh!

SparkQA commented Feb 29, 2016

Uh oh!

zhichao-li commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

rxin commented Jun 15, 2016

Uh oh!

Uh oh!