[SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling #8835

rxin · 2015-09-19T06:44:48Z

This patch refactors Python UDF handling:

Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs.
Use PythonRunner in Spark SQL's BatchPythonEvaluation.
Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5.

There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small.

This basically implements the approach in #8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution.

rxin · 2015-09-19T07:28:52Z

cc @davies

SparkQA · 2015-09-19T17:46:08Z

Test build #42712 has finished for PR 8835 at commit 8bfe6c4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriterThread(

davies · 2015-09-19T18:12:54Z

Could you try to re-use the code in Python UDF?

rxin · 2015-09-20T01:24:02Z

@davies I've updated BatchPythonEvaluation too.

davies · 2015-09-20T03:33:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUDFs.scala

@@ -342,51 +348,57 @@ case class BatchPythonEvaluation(udf: PythonUDF, output: Seq[Attribute], child:
  override def canProcessSafeRows: Boolean = true

  protected override def doExecute(): RDD[InternalRow] = {
-    val childResults = child.execute().map(_.copy())
+    val inputRDD = child.execute()


I think we should keep the copy() here.

SparkQA · 2015-09-20T04:18:43Z

Test build #42719 has finished for PR 8835 at commit 8d3c495.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriterThread(

SparkQA · 2015-09-20T06:57:57Z

Test build #42724 has finished for PR 8835 at commit 5e55bf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriterThread(

JoshRosen · 2015-09-22T19:34:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUDFs.scala

+ *
+ * For each row we send to Python, we also put it in a queue. For each output row from Python,
+ * we drain the queue to find the original input row. Note that if the Python process is way too
+ * slow, this could lead to the queue growing unbounded and eventually run out of memory.


Could we mitigate this by using a LinkedBlockingDeque to have the producer-side block on inserts once the queue grows to a certain size?

Per discussion offline, the only scenario where the queue can grow really large is when the Python buffer size has been configured to be very large and the UDF result rows are very small. As a result, I think that this comment should be expanded / clarified, but this can take place in a followup PR.

JoshRosen · 2015-09-22T21:06:45Z

Based on some offline discussion / debate, we've decided to merge this patch into both master and branch-1.5. I'm going to merge this now.

…ndling This patch refactors Python UDF handling: 1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs. 2. Use PythonRunner in Spark SQL's BatchPythonEvaluation. 3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5. There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small. This basically implements the approach in #8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution. Author: Reynold Xin <rxin@databricks.com> Closes #8835 from rxin/python-iter-refactor. (cherry picked from commit a96ba40) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Refactor PythonRDD to decouple iterator computation from PythonRDD.

8bfe6c4

rxin force-pushed the python-iter-refactor branch from 62fdea2 to 8bfe6c4 Compare September 19, 2015 06:53

rxin changed the title ~~[WIP] Refactor PythonRDD to decouple iterator computation from PythonRDD.~~ [SPARK-10714] Refactor PythonRDD to decouple iterator computation from PythonRDD. Sep 19, 2015

Updated BatchPythonEvaluation code to evaluate the input RDD only once.

ff46cdb

Input RDD

8d3c495

rxin changed the title ~~[SPARK-10714] Refactor PythonRDD to decouple iterator computation from PythonRDD.~~ [SPARK-10714][SPARK-8632][SPARK-10685] Refactor Python UDF execution Sep 20, 2015

rxin changed the title ~~[SPARK-10714][SPARK-8632][SPARK-10685] Refactor Python UDF execution~~ [SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling Sep 20, 2015

rxin mentioned this pull request Sep 20, 2015

[SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of R… #8662

Closed

davies reviewed Sep 20, 2015
View reviewed changes

copy

5e55bf6

JoshRosen reviewed Sep 22, 2015
View reviewed changes

JoshRosen mentioned this pull request Sep 22, 2015

[SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Python UDF should only compute the upstream once #8833

Closed

asfgit closed this in a96ba40 Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling #8835

[SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling #8835

Uh oh!

rxin commented Sep 19, 2015

Uh oh!

rxin commented Sep 19, 2015

Uh oh!

SparkQA commented Sep 19, 2015

Uh oh!

davies commented Sep 19, 2015

Uh oh!

rxin commented Sep 20, 2015

Uh oh!

davies Sep 20, 2015

Uh oh!

rxin Sep 20, 2015

Uh oh!

SparkQA commented Sep 20, 2015

Uh oh!

SparkQA commented Sep 20, 2015

Uh oh!

JoshRosen Sep 22, 2015

Uh oh!

JoshRosen Sep 22, 2015

Uh oh!

JoshRosen commented Sep 22, 2015

Uh oh!

Uh oh!

[SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling #8835

[SPARK-10714][SPARK-8632][SPARK-10685][SQL] Refactor Python UDF handling #8835

Uh oh!

Conversation

rxin commented Sep 19, 2015

Uh oh!

rxin commented Sep 19, 2015

Uh oh!

SparkQA commented Sep 19, 2015

Uh oh!

davies commented Sep 19, 2015

Uh oh!

rxin commented Sep 20, 2015

Uh oh!

davies Sep 20, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Sep 20, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2015

Uh oh!

SparkQA commented Sep 20, 2015

Uh oh!

JoshRosen Sep 22, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Sep 22, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Sep 22, 2015

Uh oh!

Uh oh!