[SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark #4629

davies · 2015-02-16T21:31:33Z

Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.

The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix reservePartitioner in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.

SparkQA · 2015-02-16T21:32:28Z

Test build #27573 has started for PR 4629 at commit eb26c62.

This patch merges cleanly.

JoshRosen · 2015-02-16T21:47:43Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+    } else {
+      new UnionRDD(this, rdds)
+    }
+  }

  /** Build the union of a list of RDDs passed as variable-length arguments. */
  def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =


Can we change this method to call the union method that you modified so the change will take effect here, too?

SparkQA · 2015-02-16T22:02:24Z

Test build #27582 has started for PR 4629 at commit ff5a0a6.

This patch merges cleanly.

SparkQA · 2015-02-16T22:07:34Z

Test build #27583 has started for PR 4629 at commit 940245e.

This patch merges cleanly.

SparkQA · 2015-02-16T22:18:06Z

Test build #27587 has started for PR 4629 at commit cc28d97.

This patch merges cleanly.

SparkQA · 2015-02-16T22:41:27Z

Test build #27573 has finished for PR 4629 at commit eb26c62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-16T22:41:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27573/
Test PASSed.

SparkQA · 2015-02-16T22:54:50Z

Test build #27582 has finished for PR 4629 at commit ff5a0a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):

AmplabJenkins · 2015-02-16T22:54:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27582/
Test FAILed.

SparkQA · 2015-02-16T23:26:16Z

Test build #610 has started for PR 4629 at commit cc28d97.

This patch merges cleanly.

SparkQA · 2015-02-16T23:32:18Z

Test build #27583 has finished for PR 4629 at commit 940245e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):

AmplabJenkins · 2015-02-16T23:32:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27583/
Test PASSed.

SparkQA · 2015-02-16T23:36:31Z

Test build #27587 has finished for PR 4629 at commit cc28d97.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-16T23:36:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27587/
Test PASSed.

SparkQA · 2015-02-17T00:40:45Z

Test build #610 has finished for PR 4629 at commit cc28d97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-02-17T01:51:37Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -961,11 +961,18 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
  }

  /** Build the union of a list of RDDs. */
-  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)
+  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
+    val partitioners = rdds.map(_.partitioner).toSet


If _.partitioner is an option, then I think this can be simplified by using flatMap instead of map, since that would just let you check whether partitioners.size == 1 on the next line without having to have the isDefined check as well.

JoshRosen · 2015-02-17T02:23:13Z

LGTM overall; this is tricky logic, though, so I'll take one more pass through when I get home.

SparkQA · 2015-02-17T06:07:31Z

Test build #27612 has started for PR 4629 at commit 4d29932.

This patch merges cleanly.

SparkQA · 2015-02-17T06:08:36Z

Test build #27612 has finished for PR 4629 at commit 4d29932.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):

AmplabJenkins · 2015-02-17T06:08:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27612/
Test FAILed.

SparkQA · 2015-02-17T06:24:57Z

Test build #611 has started for PR 4629 at commit 4d29932.

This patch merges cleanly.

SparkQA · 2015-02-17T07:40:49Z

Test build #611 has finished for PR 4629 at commit 4d29932.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-02-17T20:51:51Z

python/pyspark/tests.py

@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
        converted_rdd = RDD(data_python_rdd, self.sc)
        self.assertEqual(2, converted_rdd.count())

+    def test_narrow_dependency_in_join(self):
+        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))


do these tests actually check for a narrow dependency at all? I think they will pass even without it.

I'm not sure of a better suggestion, though. I had to use getNarrowDependencies in another PR to check this:
https://github.com/apache/spark/pull/4449/files#diff-4bc3643ce90b54113cad7104f91a075bR582

but I don't think that is even exposed in pyspark ...

This test is only for correctness, I will add more check for narrow dependency base one the Python progress API (#3027)

I've merged #3027, so I think we can now test this by setting a job group, running a job, then querying the statusTracker to determine how many stages were actually run as part of that job.

SparkQA · 2015-02-17T21:57:34Z

Test build #27657 has started for PR 4629 at commit dffe34e.

This patch merges cleanly.

SparkQA · 2015-02-17T23:17:43Z

Test build #27657 has finished for PR 4629 at commit dffe34e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Partitioner(object):
- class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):
- class SparkStageInfo(namedtuple("SparkStageInfo",
- class StatusTracker(object):

AmplabJenkins · 2015-02-17T23:17:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27657/
Test PASSed.

JoshRosen · 2015-02-18T00:54:37Z

Thanks for adding the test.

LGTM, so I'm going to merge this into master (1.4.0) and branch-1.3 (1.3.0). Thanks!

Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in. The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage. Author: Davies Liu <davies@databricks.com> Closes #4629 from davies/narrow and squashes the following commits: dffe34e [Davies Liu] improve test, check number of stages for join/cogroup 1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow 4d29932 [Davies Liu] address comment cc28d97 [Davies Liu] add unit tests 940245e [Davies Liu] address comments ff5a0a6 [Davies Liu] skip the partitionBy() on Python side eb26c62 [Davies Liu] narrow dependency in PySpark (cherry picked from commit c3d2b90) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

narrow dependency in PySpark

eb26c62

JoshRosen reviewed Feb 16, 2015
View reviewed changes

skip the partitionBy() on Python side

ff5a0a6

address comments

940245e

add unit tests

cc28d97

JoshRosen reviewed Feb 17, 2015
View reviewed changes

address comment

4d29932

squito reviewed Feb 17, 2015
View reviewed changes

Merge branch 'master' of github.com:apache/spark into narrow

1ed3ba2

improve test, check number of stages for join/cogroup

dffe34e

asfgit closed this in c3d2b90 Feb 18, 2015

[SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark #4629

[SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark #4629

Uh oh!

Conversation

davies commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

AmplabJenkins commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

AmplabJenkins commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

AmplabJenkins commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

AmplabJenkins commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

AmplabJenkins commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

AmplabJenkins commented Feb 17, 2015

Uh oh!

JoshRosen commented Feb 18, 2015

Uh oh!

Uh oh!