[SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD #4145

kayousterhout · 2015-01-21T22:03:46Z

CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object.

This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here.

I did some simple experiments to see how much this effects closure size. For this example:
$ val a = sc.parallelize(1 to 10).map((_, 1))
$ val b = sc.parallelize(1 to 2).map(x => (x, 2*x))
$ a.cogroup(b).collect()
the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle.

For this example:
$ val sortedA = a.sortByKey()
$ val sortedB = b.sortByKey()
$ sortedA.cogroup(sortedB).collect()
the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies.

The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). It would also get bigger for a big RDD -- although I can't think of any examples where the RDD object gets large. The difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures.

SparkQA · 2015-01-21T22:57:51Z

Test build #25917 has finished for PR 4145 at commit 912d48d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

suyanNone · 2015-04-01T04:06:50Z

@JoshRosen Can someone verify this patch?

rxin · 2015-04-07T05:17:40Z

core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala


+/** The references to rdd and splitIndex are transient because redundant information is stored
+  * in the CoGroupedRDD object.  Because CoGroupedRDD is serialized separately from
+  * CoGrpupPartition, if rdd and splitIndex aren't transient, they'll be included twice in the


nit pick CoGroupPartition

rxin · 2015-04-07T21:44:56Z

core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala

+ * corresponding index.
+ */
+private[spark] class CoGroupPartition(
+    idx: Int, val narrowDeps: Array[Option[NarrowCoGroupSplitDep]])


as discussed offline, let's make it explicit that the size of the array == number of parents.

rxin · 2015-04-07T21:45:01Z

LGTM otherwise!

SparkQA · 2015-04-07T22:32:29Z

Test build #29812 has finished for PR 4145 at commit 229d263.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-13T21:27:31Z

Test build #30189 has finished for PR 4145 at commit 85156c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Assignment(id: Long, cluster: Int)
This patch does not change any dependencies.

kayousterhout · 2015-04-20T20:26:18Z

Jenkins, retest this please

SparkQA · 2015-04-20T22:10:21Z

Test build #30604 has finished for PR 4145 at commit 85156c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

…oGroupedRDD CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x => (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). It would also get bigger for a big RDD -- although I can't think of any examples where the RDD object gets large. The difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes apache#4145 from kayousterhout/SPARK-5360 and squashes the following commits: 85156c3 [Kay Ousterhout] Better comment the narrowDeps parameter cff0209 [Kay Ousterhout] Fixed spelling issue 658e1af [Kay Ousterhout] [SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD

kayousterhout mentioned this pull request Mar 30, 2015

[SPARK-6606][CORE]Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object. #5259

Closed

kayousterhout changed the title ~~[SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD~~ [SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD Apr 6, 2015

rxin reviewed Apr 7, 2015
View reviewed changes

kayousterhout force-pushed the SPARK-5360 branch from 912d48d to 229d263 Compare April 7, 2015 21:06

rxin reviewed Apr 7, 2015
View reviewed changes

kayousterhout added 3 commits April 13, 2015 13:01

[SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD

658e1af

Fixed spelling issue

cff0209

Better comment the narrowDeps parameter

85156c3

kayousterhout force-pushed the SPARK-5360 branch from 229d263 to 85156c3 Compare April 13, 2015 20:01

asfgit closed this in c035c0f Apr 21, 2015

kayousterhout deleted the SPARK-5360 branch April 12, 2017 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD #4145

[SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD #4145

Uh oh!

kayousterhout commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

suyanNone commented Apr 1, 2015

Uh oh!

rxin Apr 7, 2015

Uh oh!

rxin Apr 7, 2015

Uh oh!

rxin commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

kayousterhout commented Apr 20, 2015

Uh oh!

SparkQA commented Apr 20, 2015

Uh oh!

Uh oh!

[SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD #4145

[SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD #4145

Uh oh!

Conversation

kayousterhout commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

suyanNone commented Apr 1, 2015

Uh oh!

rxin Apr 7, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Apr 7, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

kayousterhout commented Apr 20, 2015

Uh oh!

SparkQA commented Apr 20, 2015

Uh oh!

Uh oh!