Skip to content

[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

icexelloss
Copy link
Contributor

What changes were proposed in this pull request?

Add tests for using non deterministic UDFs in aggregate.

Update pandas_udf docstring w.r.t to determinism.

How was this patch tested?

test_nondeterministic_udf_in_aggregate

@icexelloss
Copy link
Contributor Author

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85644 has finished for PR 20142 at commit 1f4183f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

Thanks for doing this @icexelloss ! Should we add a non-deterministic test for pandas_udf like here ?

@icexelloss
Copy link
Contributor Author

@BryanCutler Yeah I think we could. Let me add it.

@icexelloss icexelloss force-pushed the SPARK-22930-pandas-udf-deterministic branch from 1f4183f to 46c6ad7 Compare January 5, 2018 21:13
@icexelloss
Copy link
Contributor Author

I added the test. @gatorsmile do you have to take a look or let me know who should I ping for review?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how strict we want the testing to be here, but we might want to verify that nonDeterministic is working correctly, not just that it's a valid pandas_udf

for row in result1:
self.assertTrue(0.0 <= row.rand < 1.0)
for row in result2:
self.assertTrue(0.0 <= row.rand < 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should be checking that the optimizer doesn't cache any previous results. I think the non-pandas udf test I linked above did that by comparing the original non-deterministic data plus a constant to that of adding the same constant as a deterministic udf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha I see. Let me change the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test to be similar to the non-pandas one.

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85729 has finished for PR 20142 at commit 46c6ad7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85731 has finished for PR 20142 at commit 0d8d943.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85732 has finished for PR 20142 at commit b249bac.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85735 has finished for PR 20142 at commit 2de3a37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -3567,6 +3580,18 @@ def tearDownClass(cls):
time.tzset()
ReusedSQLTestCase.tearDownClass()

@property
def random_udf(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add "nondeterministic" in its name somehow?

Copy link
Member

@viirya viirya Jan 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe nondeterministic_udf. So we don't have duplicate name to random_udf too.

@HyukjinKwon
Copy link
Member

LGTM except for the one minor comment

@@ -3950,6 +3975,33 @@ def test_vectorized_udf_timestamps_respect_session_timezone(self):
finally:
self.spark.conf.set("spark.sql.session.timeZone", orig_tz)

def test_nondeterministic_udf(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_vectorized_nondeterministic_udf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_nondeterministic_vectorized_udf

self.assertEqual(random_udf.deterministic, False)
self.assertTrue(result1['plus_ten(rand)'].equals(result1['rand'] + 10))

def test_nondeterministic_udf_in_aggregate(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_vectorized_nondeterministic_udf_in_aggregate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_nondeterministic_vectorized_udf_in_aggregate

@@ -3567,6 +3580,18 @@ def tearDownClass(cls):
time.tzset()
ReusedSQLTestCase.tearDownClass()

@property
def random_udf(self):
Copy link
Member

@viirya viirya Jan 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe nondeterministic_udf. So we don't have duplicate name to random_udf too.

@viirya
Copy link
Member

viirya commented Jan 6, 2018

LGTM with minor comments regarding naming.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 6, 2018

Thanks! Merged to master/2.3

Will address the comments in my PR.

asfgit pushed a commit that referenced this pull request Jan 6, 2018
… for non-deterministic cases

## What changes were proposed in this pull request?

Add tests for using non deterministic UDFs in aggregate.

Update pandas_udf docstring w.r.t to determinism.

## How was this patch tested?
test_nondeterministic_udf_in_aggregate

Author: Li Jin <ice.xelloss@gmail.com>

Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.

(cherry picked from commit f2dd8b9)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in f2dd8b9 Jan 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants