-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test base #25101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #107474 has finished for PR 25101 at commit
|
retest this please |
Test build #107487 has finished for PR 25101 at commit
|
|
||
|
||
-- Except operation that will be replaced by left anti join | ||
SELECT * FROM t1 EXCEPT SELECT * FROM t2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly select fields with udf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will change to
SELECT udf(k), udf(v) FROM t1 EXCEPT SELECT udf(k), udf(v) FROM t2;
FROM t1 | ||
WHERE t1.v <= (SELECT max(udf(t2.v)) | ||
FROM t2 | ||
WHERE t2.k = t1.k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a udf here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change to (udf)t2.k = (udf)t1.k
-- Except operation that will be replaced by left anti join | ||
SELECT t1.k | ||
FROM t1 | ||
WHERE t1.v <= (SELECT max(udf(t2.v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do udf(max(udf(...)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried udf(max(udf(t2.v)))
query 8 schema and output changed to
-- !query 8 schema
struct<>
-- !query 8 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)
The expected results are
-- !query 8 schema
struct<k:string>
-- !query 8 output
two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that, and comment the test with a new JIRA as guided in the parent JIRA (I think I wrote some words in the guide for this case as well). Actually, finding such cases and fixing it is one of the key points of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more tests on this max
. Here are the results:
-- !query 10
SELECT max(t2.v) FROM t2
-- !query 10 schema
struct<max(v):int>
-- !query 10 output
22
-- !query 11
SELECT udf(t2.v) FROM t2
-- !query 11 schema
struct<udf(v):string>
-- !query 11 output
1
22
5.0
5.0
nan
-- !query 12
SELECT max(udf(t2.v)) FROM t2
-- !query 12 schema
struct<max(udf(v)):string>
-- !query 12 output
nan
-- !query 13
SELECT udf(max(t2.v)) FROM t2
-- !query 13 schema
struct<udf(max(v)):string>
-- !query 13 output
22
-- !query 14
SELECT udf(max(udf(t2.v))) FROM t2
-- !query 14 schema
struct<udf(max(udf(v))):string>
-- !query 14 output
nan
-- !query 15
SELECT *
FROM t1
WHERE t1.v <= (SELECT max(t2.v)
FROM t2
WHERE t2.k = t1.k)
-- !query 15 schema
struct<k:string,v:int>
-- !query 15 output
one 1
two 2
-- !query 16
SELECT *
FROM t1
WHERE t1.v <= (SELECT udf(max(t2.v))
FROM t2
WHERE t2.k = t1.k)
-- !query 16 schema
struct<>
-- !query 16 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)
-- !query 17
SELECT *
FROM t1
WHERE t1.v <= (SELECT max(udf(t2.v))
FROM t2
WHERE t2.k = t1.k)
-- !query 17 schema
struct<k:string,v:int>
-- !query 17 output
two 2
-- !query 18
SELECT *
FROM t1
WHERE t1.v <= (SELECT udf(max(udf(t2.v)))
FROM t2
WHERE t2.k = t1.k)
-- !query 18 schema
struct<>
-- !query 18 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)
I initially thought max(t2.v) returns 22
is the right behavior, but after I looked the implementation of functions.max(e: Column)
, I am not sure any more. This functions.max(e: Column)
eventually calls this
nanSafeCompareDoubles
, and it treats NaN greater than any non-NaN double. So functions.max(t2.v)
returns NaN
. Is this right?
Should we let max(t2.v) returns 22
instead of NaN
?
/**
* NaN-safe version of `java.lang.Double.compare()` which allows NaN values to be compared
* according to semantics where NaN == NaN and NaN is greater than any non-NaN double.
*/
def nanSafeCompareDoubles(x: Double, y: Double): Int = {
val xIsNan: Boolean = java.lang.Double.isNaN(x)
val yIsNan: Boolean = java.lang.Double.isNaN(y)
if ((xIsNan && yIsNan) || (x == y)) 0
else if (xIsNan) 1
else if (yIsNan) -1
else if (x > y) 1
else -1
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao, can you file a JIRA with a reproducer, and skip the test here for now? Let's comment the tests out:
--- [SPARK-XXXX] JIRA title...
--- SELECT ..
---
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted #25204 to fix this issue.
@huaxingao, #25130 is merged. Can you sync this PR to master and rebase please? For #25101 (comment), you can file a JIRA and skip the test for now. |
retest this please |
Test build #107817 has finished for PR 25101 at commit
|
cc @viirya for a quick double check if you're available. |
Test build #107868 has finished for PR 25101 at commit
|
Test build #107869 has finished for PR 25101 at commit
|
Merged to master. |
Thanks for your help! @HyukjinKwon @viirya |
What changes were proposed in this pull request?
This PR adds some tests converted from
except.sql
to test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.Diff comparing to 'except.sql'
How was this patch tested?
Tested as guided in SPARK-27921.