[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142

icexelloss · 2018-01-03T22:00:41Z

What changes were proposed in this pull request?

Add tests for using non deterministic UDFs in aggregate.

Update pandas_udf docstring w.r.t to determinism.

How was this patch tested?

test_nondeterministic_udf_in_aggregate

icexelloss · 2018-01-03T22:01:47Z

cc @gatorsmile

SparkQA · 2018-01-03T22:31:41Z

Test build #85644 has finished for PR 20142 at commit 1f4183f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-01-04T19:15:51Z

Thanks for doing this @icexelloss ! Should we add a non-deterministic test for pandas_udf like here ?

icexelloss · 2018-01-04T19:22:29Z

@BryanCutler Yeah I think we could. Let me add it.

…of pandas_udf w.r.t determinism

icexelloss · 2018-01-05T21:15:11Z

I added the test. @gatorsmile do you have to take a look or let me know who should I ping for review?

BryanCutler

Not sure how strict we want the testing to be here, but we might want to verify that nonDeterministic is working correctly, not just that it's a valid pandas_udf

BryanCutler · 2018-01-05T21:28:20Z

python/pyspark/sql/tests.py

+        for row in result1:
+            self.assertTrue(0.0 <= row.rand < 1.0)
+        for row in result2:
+            self.assertTrue(0.0 <= row.rand < 1.0)


Ideally we should be checking that the optimizer doesn't cache any previous results. I think the non-pandas udf test I linked above did that by comparing the original non-deterministic data plus a constant to that of adding the same constant as a deterministic udf

Aha I see. Let me change the test.

I changed the test to be similar to the non-pandas one.

SparkQA · 2018-01-05T21:46:51Z

Test build #85729 has finished for PR 20142 at commit 46c6ad7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-05T22:49:47Z

Test build #85731 has finished for PR 20142 at commit 0d8d943.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-05T22:56:09Z

Test build #85732 has finished for PR 20142 at commit b249bac.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-05T23:41:18Z

Test build #85735 has finished for PR 20142 at commit 2de3a37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-06T04:10:36Z

python/pyspark/sql/tests.py

@@ -3567,6 +3580,18 @@ def tearDownClass(cls):
        time.tzset()
        ReusedSQLTestCase.tearDownClass()

+    @property
+    def random_udf(self):


Could we add "nondeterministic" in its name somehow?

Maybe nondeterministic_udf. So we don't have duplicate name to random_udf too.

HyukjinKwon · 2018-01-06T04:16:48Z

LGTM except for the one minor comment

viirya · 2018-01-06T04:19:39Z

python/pyspark/sql/tests.py

@@ -3950,6 +3975,33 @@ def test_vectorized_udf_timestamps_respect_session_timezone(self):
        finally:
            self.spark.conf.set("spark.sql.session.timeZone", orig_tz)

+    def test_nondeterministic_udf(self):


test_vectorized_nondeterministic_udf

test_nondeterministic_vectorized_udf

viirya · 2018-01-06T04:19:52Z

python/pyspark/sql/tests.py

+        self.assertEqual(random_udf.deterministic, False)
+        self.assertTrue(result1['plus_ten(rand)'].equals(result1['rand'] + 10))
+
+    def test_nondeterministic_udf_in_aggregate(self):


test_vectorized_nondeterministic_udf_in_aggregate

test_nondeterministic_vectorized_udf_in_aggregate

viirya · 2018-01-06T04:22:09Z

python/pyspark/sql/tests.py

@@ -3567,6 +3580,18 @@ def tearDownClass(cls):
        time.tzset()
        ReusedSQLTestCase.tearDownClass()

+    @property
+    def random_udf(self):


Maybe nondeterministic_udf. So we don't have duplicate name to random_udf too.

viirya · 2018-01-06T04:25:02Z

LGTM with minor comments regarding naming.

gatorsmile · 2018-01-06T08:05:48Z

Thanks! Merged to master/2.3

Will address the comments in my PR.

… for non-deterministic cases ## What changes were proposed in this pull request? Add tests for using non deterministic UDFs in aggregate. Update pandas_udf docstring w.r.t to determinism. ## How was this patch tested? test_nondeterministic_udf_in_aggregate Author: Li Jin <ice.xelloss@gmail.com> Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic. (cherry picked from commit f2dd8b9) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

Add test for using non deterministic udf in aggregate; Fix docstring …

46c6ad7

…of pandas_udf w.r.t determinism

icexelloss force-pushed the SPARK-22930-pandas-udf-deterministic branch from 1f4183f to 46c6ad7 Compare January 5, 2018 21:13

BryanCutler reviewed Jan 5, 2018

View reviewed changes

icexelloss added 2 commits January 5, 2018 17:26

Fix test_nondeterministic_udf

0d8d943

Small comment fix

b249bac

Remove pandas.testing

2de3a37

HyukjinKwon reviewed Jan 6, 2018

View reviewed changes

viirya reviewed Jan 6, 2018

View reviewed changes

asfgit closed this in f2dd8b9 Jan 6, 2018

[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142

[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142

Uh oh!

Conversation

icexelloss commented Jan 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

icexelloss commented Jan 3, 2018

Uh oh!

SparkQA commented Jan 3, 2018

Uh oh!

BryanCutler commented Jan 4, 2018

Uh oh!

icexelloss commented Jan 4, 2018

Uh oh!

icexelloss commented Jan 5, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 5, 2018

Uh oh!

SparkQA commented Jan 5, 2018

Uh oh!

SparkQA commented Jan 5, 2018

Uh oh!

SparkQA commented Jan 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 6, 2018

Uh oh!

gatorsmile commented Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

viirya Jan 6, 2018 •

edited

Loading

viirya Jan 6, 2018 •

edited

Loading

gatorsmile commented Jan 6, 2018 •

edited

Loading