[SPARK-25213][PYTHON] Add project to v2 scans before python filters. #22206

rdblue · 2018-08-23T17:41:37Z

What changes were proposed in this pull request?

The v2 API always adds a projection when converting to physical plan to ensure that rows are all UnsafeRow. This is added after any filters run by Spark, assuming that the filter and projection can handle InternalRow, but this fails if those nodes contain python UDFs. This PR detects the Python UDFs and adds a projection above the filter to immediately convert to UnsafeRow before passing data to python.

How was this patch tested?

This adds a test for the case reported in SPARK-25213 in python's SQL tests.

SparkQA · 2018-08-23T20:54:07Z

Test build #95173 has finished for PR 22206 at commit c49157e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-23T23:16:08Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

-      ProjectExec(project, withFilter) :: Nil
+      if (project.exists(hasScalarPythonUDF)) {
+        val references = project.map(_.references).reduce(_ ++ _).toSeq
+        ProjectExec(project, ProjectExec(references, withFilter)) :: Nil


Why do we need to add extra Project on top of Filter here?

The v2 data sources return InternalRow, not UnsafeRow. Python UDFs can't handle InternalRow, so this is intended to add a projection to convert to unsafe before the projection that contains a python UDF.

oh, I see. It is also used to make sure PythonUDF in top Project takes unsafe row input.

nit: If we already add Project on top of Filter, we don't need to add another Project here, right?

That one was only added if there was a filter and if that filter ran a UDF. This will add an unnecessary project if both the filter and the project have python UDFs, but I thought that was probably okay. I can add a boolean to signal if the filter caused one to be added already if you think it's worth it.

Ok. Let's leave as it is now.

+1 for leaving as is.

viirya · 2018-08-24T00:57:16Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import udf
+
+        df = self.spark.read.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2").load()
+        result = df.withColumn('x', udf(lambda x: x, 'int')(df['i']))


This only tests Project with Scalar PythonUDF? Might be better to also test Filter case.

Agreed. I was just verifying that the fix worked before spending more time on it.

SparkQA · 2018-08-24T03:35:12Z

Test build #95187 has finished for PR 22206 at commit 550368e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-25T10:29:03Z

python/pyspark/sql/tests.py

+    def test_pyspark_udf_SPARK_25213(self):
+        from pyspark.sql.functions import udf
+
+        df = self.spark.read.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2").load()


I think this test will fail if test classes are not compiled. Can we check if test classes are compiled and then skip if not existent?

HyukjinKwon · 2018-08-25T10:30:11Z

python/pyspark/sql/tests.py



+class DataSourceV2Tests(ReusedSQLTestCase):
+    def test_pyspark_udf_SPARK_25213(self):


not a big deal but I would avoid SPARK_25213 postfix at the end just for consistency.

I like that the tests in Scala include this information somewhere. Is there a better place for it in PySpark? I'm not aware of another way to pass extra metadata, but I'm open to if it there's a better way.

HyukjinKwon · 2018-08-25T10:32:27Z

Not a big deal but PR title: [SPARK-25213][PYTHON] ... per the guide.

rdblue · 2018-08-27T21:22:25Z

@HyukjinKwon and @viirya, thank you for looking at this commit, but I like @cloud-fan's approach to fixing this in #22244 better than this work-around. I'm going to close this in favor of that approach, although if we need a quick fix I can pick this back up.

SPARK-25213: Add project to v2 scans before python filters.

ada6e92

This comment has been minimized.

Sign in to view

rdblue added 2 commits August 23, 2018 10:46

SPARK-25213: Add project to v2 scans before python projections.

c2e1bc7

SPARK-24213: Fix pyspark style.

c49157e

This comment has been minimized.

Sign in to view

viirya reviewed Aug 23, 2018

View reviewed changes

SPARK-25213: Fix PySpark test.

550368e

viirya reviewed Aug 24, 2018

View reviewed changes

HyukjinKwon reviewed Aug 25, 2018

View reviewed changes

cloud-fan mentioned this pull request Aug 27, 2018

[WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF at the end of optimizer #22244

Closed

rdblue changed the title ~~SPARK-25213: Add project to v2 scans before python filters.~~ [SPARK-25213][PYTHON] Add project to v2 scans before python filters. Aug 27, 2018

rdblue closed this Aug 27, 2018

icexelloss mentioned this pull request Sep 3, 2018

[SPARK-24721][SQL] Extract Python UDFs at the end of optimizer #22104

Closed



		class DataSourceV2Tests(ReusedSQLTestCase):
		def test_pyspark_udf_SPARK_25213(self):

[SPARK-25213][PYTHON] Add project to v2 scans before python filters. #22206

[SPARK-25213][PYTHON] Add project to v2 scans before python filters. #22206

Uh oh!

Conversation

rdblue commented Aug 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

This comment has been minimized.

This comment has been minimized.

SparkQA commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 25, 2018

Uh oh!

rdblue commented Aug 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants