Skip to content

[SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test base #25101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Jul 10, 2019

What changes were proposed in this pull request?

This PR adds some tests converted from except.sql to test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.

Diff comparing to 'except.sql'

diff --git a/sql/core/src/test/resources/sql-tests/results/except.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
index c9b712d4d2..27ca7ea226 100644
--- a/sql/core/src/test/resources/sql-tests/results/except.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
@@ -30,16 +30,16 @@ struct<>
 
 
 -- !query 2
-SELECT * FROM t1 EXCEPT SELECT * FROM t2
+SELECT udf(k), udf(v) FROM t1 EXCEPT SELECT udf(k), udf(v) FROM t2
 -- !query 2 schema
-struct<k:string,v:int>
+struct<CAST(udf(cast(k as string)) AS STRING):string,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 2 output
 three  3
 two    2
 
 
 -- !query 3
-SELECT * FROM t1 EXCEPT SELECT * FROM t1 where v <> 1 and v <> 2
+SELECT * FROM t1 EXCEPT SELECT * FROM t1 where udf(v) <> 1 and v <> udf(2)
 -- !query 3 schema
 struct<k:string,v:int>
 -- !query 3 output
@@ -49,7 +49,7 @@ two   2
 
 
 -- !query 4
-SELECT * FROM t1 where v <> 1 and v <> 22 EXCEPT SELECT * FROM t1 where v <> 2 and v >= 3
+SELECT * FROM t1 where udf(v) <> 1 and v <> udf(22) EXCEPT SELECT * FROM t1 where udf(v) <> 2 and v >= udf(3)
 -- !query 4 schema
 struct<k:string,v:int>
 -- !query 4 output
@@ -59,7 +59,7 @@ two   2
 -- !query 5
 SELECT t1.* FROM t1, t2 where t1.k = t2.k
 EXCEPT
-SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != 'one'
+SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != udf('one')
 -- !query 5 schema
 struct<k:string,v:int>
 -- !query 5 output
@@ -68,7 +68,7 @@ one   NULL
 
 
 -- !query 6
-SELECT * FROM t2 where v >= 1 and v <> 22 EXCEPT SELECT * FROM t1
+SELECT * FROM t2 where v >= udf(1) and udf(v) <> 22 EXCEPT SELECT * FROM t1
 -- !query 6 schema
 struct<k:string,v:int>
 -- !query 6 output
@@ -77,9 +77,9 @@ one   5
 
 
 -- !query 7
-SELECT (SELECT min(k) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
+SELECT (SELECT min(udf(k)) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
 MINUS
-SELECT (SELECT min(k) FROM t2) abs_min_t2 FROM t1 WHERE  t1.k = 'one'
+SELECT (SELECT udf(min(k)) FROM t2) abs_min_t2 FROM t1 WHERE  t1.k = udf('one')
 -- !query 7 schema
 struct<min_t2:string>
 -- !query 7 output
@@ -90,16 +90,17 @@ two
 -- !query 8
 SELECT t1.k
 FROM   t1
-WHERE  t1.v <= (SELECT   max(t2.v)
+WHERE  t1.v <= (SELECT   udf(max(udf(t2.v)))
                 FROM     t2
-                WHERE    t2.k = t1.k)
+                WHERE    udf(t2.k) = udf(t1.k))
 MINUS
 SELECT t1.k
 FROM   t1
-WHERE  t1.v >= (SELECT   min(t2.v)
+WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v))
                 FROM     t2
                 WHERE    t2.k = t1.k)
 -- !query 8 schema
-struct<k:string>
+struct<>
 -- !query 8 output
-two
+java.lang.UnsupportedOperationException
+Cannot evaluate expression: udf(cast(null as string))

How was this patch tested?

Tested as guided in SPARK-27921.

@SparkQA
Copy link

SparkQA commented Jul 10, 2019

Test build #107474 has finished for PR 25101 at commit fca03a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28277][SQL][PYTHON][TESTS]Convert and port 'except.sql' into UDF test base [SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test base Jul 10, 2019
@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107487 has finished for PR 25101 at commit fca03a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.



-- Except operation that will be replaced by left anti join
SELECT * FROM t1 EXCEPT SELECT * FROM t2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly select fields with udf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will change to

SELECT udf(k), udf(v) FROM t1 EXCEPT SELECT udf(k), udf(v) FROM t2;

FROM t1
WHERE t1.v <= (SELECT max(udf(t2.v))
FROM t2
WHERE t2.k = t1.k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a udf here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change to (udf)t2.k = (udf)t1.k

-- Except operation that will be replaced by left anti join
SELECT t1.k
FROM t1
WHERE t1.v <= (SELECT max(udf(t2.v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do udf(max(udf(...)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried udf(max(udf(t2.v)))
query 8 schema and output changed to

-- !query 8 schema
struct<>
-- !query 8 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)

The expected results are

-- !query 8 schema
struct<k:string>
-- !query 8 output
two

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that, and comment the test with a new JIRA as guided in the parent JIRA (I think I wrote some words in the guide for this case as well). Actually, finding such cases and fixing it is one of the key points of doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more tests on this max. Here are the results:

-- !query 10
SELECT max(t2.v) FROM t2
-- !query 10 schema
struct<max(v):int>
-- !query 10 output
22


-- !query 11
SELECT udf(t2.v) FROM t2
-- !query 11 schema
struct<udf(v):string>
-- !query 11 output
1
22
5.0
5.0
nan


-- !query 12
SELECT max(udf(t2.v)) FROM t2
-- !query 12 schema
struct<max(udf(v)):string>
-- !query 12 output
nan


-- !query 13
SELECT udf(max(t2.v)) FROM t2
-- !query 13 schema
struct<udf(max(v)):string>
-- !query 13 output
22


-- !query 14
SELECT udf(max(udf(t2.v))) FROM t2
-- !query 14 schema
struct<udf(max(udf(v))):string>
-- !query 14 output
nan


-- !query 15
SELECT *
FROM   t1
WHERE  t1.v <= (SELECT   max(t2.v)
                FROM     t2
                WHERE    t2.k = t1.k)
-- !query 15 schema
struct<k:string,v:int>
-- !query 15 output
one	1
two	2


-- !query 16
SELECT *
FROM   t1
WHERE  t1.v <= (SELECT   udf(max(t2.v))
                FROM     t2
                WHERE    t2.k = t1.k)
-- !query 16 schema
struct<>
-- !query 16 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)


-- !query 17
SELECT *
FROM   t1
WHERE  t1.v <= (SELECT   max(udf(t2.v))
                FROM     t2
                WHERE    t2.k = t1.k)
-- !query 17 schema
struct<k:string,v:int>
-- !query 17 output
two	2


-- !query 18
SELECT *
FROM   t1
WHERE  t1.v <= (SELECT   udf(max(udf(t2.v)))
                FROM     t2
                WHERE    t2.k = t1.k)
-- !query 18 schema
struct<>
-- !query 18 output
java.lang.UnsupportedOperationException
Cannot evaluate expression: udf(null)

I initially thought max(t2.v) returns 22 is the right behavior, but after I looked the implementation of functions.max(e: Column), I am not sure any more. This functions.max(e: Column) eventually calls this
nanSafeCompareDoubles, and it treats NaN greater than any non-NaN double. So functions.max(t2.v) returns NaN. Is this right?
Should we let max(t2.v) returns 22 instead of NaN?

  /**
   * NaN-safe version of `java.lang.Double.compare()` which allows NaN values to be compared
   * according to semantics where NaN == NaN and NaN is greater than any non-NaN double.
   */
def nanSafeCompareDoubles(x: Double, y: Double): Int = {
    val xIsNan: Boolean = java.lang.Double.isNaN(x)
    val yIsNan: Boolean = java.lang.Double.isNaN(y)
    if ((xIsNan && yIsNan) || (x == y)) 0
    else if (xIsNan) 1
    else if (yIsNan) -1
    else if (x > y) 1
    else -1
  }

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao, can you file a JIRA with a reproducer, and skip the test here for now? Let's comment the tests out:

--- [SPARK-XXXX] JIRA title...
--- SELECT ..
---

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted #25204 to fix this issue.

@HyukjinKwon
Copy link
Member

@huaxingao, #25130 is merged. Can you sync this PR to master and rebase please?

For #25101 (comment), you can file a JIRA and skip the test for now.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107817 has finished for PR 25101 at commit fca03a1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @viirya for a quick double check if you're available.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107868 has finished for PR 25101 at commit 6e01a86.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107869 has finished for PR 25101 at commit d956d37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@huaxingao
Copy link
Contributor Author

Thanks for your help! @HyukjinKwon @viirya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants