-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases #20142
Conversation
cc @gatorsmile |
Test build #85644 has finished for PR 20142 at commit
|
Thanks for doing this @icexelloss ! Should we add a non-deterministic test for |
@BryanCutler Yeah I think we could. Let me add it. |
…of pandas_udf w.r.t determinism
1f4183f
to
46c6ad7
Compare
I added the test. @gatorsmile do you have to take a look or let me know who should I ping for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how strict we want the testing to be here, but we might want to verify that nonDeterministic is working correctly, not just that it's a valid pandas_udf
python/pyspark/sql/tests.py
Outdated
for row in result1: | ||
self.assertTrue(0.0 <= row.rand < 1.0) | ||
for row in result2: | ||
self.assertTrue(0.0 <= row.rand < 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should be checking that the optimizer doesn't cache any previous results. I think the non-pandas udf test I linked above did that by comparing the original non-deterministic data plus a constant to that of adding the same constant as a deterministic udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha I see. Let me change the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the test to be similar to the non-pandas one.
Test build #85729 has finished for PR 20142 at commit
|
Test build #85731 has finished for PR 20142 at commit
|
Test build #85732 has finished for PR 20142 at commit
|
Test build #85735 has finished for PR 20142 at commit
|
@@ -3567,6 +3580,18 @@ def tearDownClass(cls): | |||
time.tzset() | |||
ReusedSQLTestCase.tearDownClass() | |||
|
|||
@property | |||
def random_udf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add "nondeterministic" in its name somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe nondeterministic_udf
. So we don't have duplicate name to random_udf
too.
LGTM except for the one minor comment |
@@ -3950,6 +3975,33 @@ def test_vectorized_udf_timestamps_respect_session_timezone(self): | |||
finally: | |||
self.spark.conf.set("spark.sql.session.timeZone", orig_tz) | |||
|
|||
def test_nondeterministic_udf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_vectorized_nondeterministic_udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_nondeterministic_vectorized_udf
self.assertEqual(random_udf.deterministic, False) | ||
self.assertTrue(result1['plus_ten(rand)'].equals(result1['rand'] + 10)) | ||
|
||
def test_nondeterministic_udf_in_aggregate(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_vectorized_nondeterministic_udf_in_aggregate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_nondeterministic_vectorized_udf_in_aggregate
@@ -3567,6 +3580,18 @@ def tearDownClass(cls): | |||
time.tzset() | |||
ReusedSQLTestCase.tearDownClass() | |||
|
|||
@property | |||
def random_udf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe nondeterministic_udf
. So we don't have duplicate name to random_udf
too.
LGTM with minor comments regarding naming. |
Thanks! Merged to master/2.3 Will address the comments in my PR. |
… for non-deterministic cases ## What changes were proposed in this pull request? Add tests for using non deterministic UDFs in aggregate. Update pandas_udf docstring w.r.t to determinism. ## How was this patch tested? test_nondeterministic_udf_in_aggregate Author: Li Jin <ice.xelloss@gmail.com> Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic. (cherry picked from commit f2dd8b9) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Add tests for using non deterministic UDFs in aggregate.
Update pandas_udf docstring w.r.t to determinism.
How was this patch tested?
test_nondeterministic_udf_in_aggregate