[SPARK-6886] [PySpark] fix big closure with shuffle #5496

davies · 2015-04-13T20:59:32Z

Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.

This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.

cc @JoshRosen

SparkQA · 2015-04-13T22:30:16Z

Test build #30192 has finished for PR 5496 at commit 9a0ea4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Assignment(id: Long, cluster: Int)
This patch does not change any dependencies.

JoshRosen · 2015-04-15T19:50:34Z

This fix looks good to me. There are two contributors to the bug here:

In certain Python-driver-side operations, like groupBy(), we create RDDs that implicitly reference the previous RDD via its JavaRDD rather than by holding an explicit reference to the parent Python RDD object, which may result in a Python driver's RDD object being garbage collected even though the Java PythonRDD object sticks around (due to the reference from a child RDD). To see this more clearly, notice that there are places in rdd.py where we call RDD(self._jrdd, ...) without actually storing a reference to self in the new derived RDD.
When an RDD is garbage-collected in the Python driver, the __del__ call here manually unpersists the Java broadcast variable even though there are still references to it.

This problem only manifests itself if Python closures are very large (large enough to trip the 1MB threshold which causes us to broadcast them) and are referenced by intermediate Python RDDs that are garbage-collected.

The fix implemented in this patch is to remove this __del__ call and leave it up to ContextCleaner to manage the Broadcast cleanup. To verify that this won't introduce memory leaks, I took a look at how we track references to the Python broadcast in branch-1.2:

When we create a Broadcast object in Python, we capture a reference to the Java broadcast:

spark/python/pyspark/broadcast.py

Line 66 in 964f544

self._jbroadcast = sc._jvm.PythonRDD.readBroadcastFromFile(sc._jsc, self._path)

. Aside from this reference, we don't keep any other references to the Java broadcast inside of the Python driver.
We do not maintain a Python-driver-side registry of these Python broadcast objects, so I don't think we have to worry about leaked references keeping the Python broadcast alive and preventing the Java broadcast from being garbage-collected.

An alternative fix would be to prevent Python-side RDDs from being garbage-collected until their corresponding Java RDDs are no longer referenced, but even if we did make this change I think it's a good idea to leave the cleaning of broadcasts to Spark's ContextCleaner rather than trying to manage it here.

I tested this out manually in an IPython REPL and broadcasts seem to be cleaned up at the right times.

Since this looks good to me, I'm going to merge this into master (1.4.0) and branch-1.3 (1.3.2). I'll try to cherry-pick into branch-1.2 (1.2.3), but we may have to open a separate PR if I can't fix the conflicts easily. Thanks @davies for fixing this!

Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #5496 from davies/big_closure and squashes the following commits: 9a0ea4c [Davies Liu] fix big closure with shuffle (cherry picked from commit f11288d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

JoshRosen · 2015-04-15T20:36:01Z

Finished the 1.2 backport as well.

Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #5496 from davies/big_closure and squashes the following commits: 9a0ea4c [Davies Liu] fix big closure with shuffle Conflicts: python/pyspark/rdd.py

davies · 2015-04-15T21:09:04Z

@JoshRosen Thanks!

fix big closure with shuffle

9a0ea4c

asfgit closed this in f11288d Apr 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-6886] [PySpark] fix big closure with shuffle #5496

[SPARK-6886] [PySpark] fix big closure with shuffle #5496

Uh oh!

davies commented Apr 13, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

JoshRosen commented Apr 15, 2015

Uh oh!

JoshRosen commented Apr 15, 2015

Uh oh!

davies commented Apr 15, 2015

Uh oh!

Uh oh!

[SPARK-6886] [PySpark] fix big closure with shuffle #5496

[SPARK-6886] [PySpark] fix big closure with shuffle #5496

Uh oh!

Conversation

davies commented Apr 13, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

JoshRosen commented Apr 15, 2015

Uh oh!

JoshRosen commented Apr 15, 2015

Uh oh!

davies commented Apr 15, 2015

Uh oh!

Uh oh!