[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

bersprockets · 2018-03-26T22:54:15Z

What changes were proposed in this pull request?

Change pyspark-sql tests to check the following:

Spark was built with the Hive profile
Spark scala tests were compiled

If either condition is not met, throw an exception with a message explaining how to appropriately build Spark.

These checks are similar to the ones found in the pyspark-streaming tests.

These required files will be missing if you follow the sbt build instructions. They are less likely to be missing if you follow the mvn build instructions (mvn compiles the test scala files, and there are mvn build instructions for running the pyspark tests).

How was this patch tested?

For sbt build:

run ./build/sbt package
run python/run-tests --modules "pyspark-sql" --python-executables python2.7
see failure, follow sbt instructions in exception message
run test again
see second failure (sbt only), follow sbt instructions in exception message
run test again, verify success
repeat for python3.4

For mvn build:

run ./build/mvn -DskipTests clean package
run python/run-tests --modules "pyspark-sql" --python-executables python2.7
see failure, follow mvn instructions in exception message
run test again, verify success
repeat for python3.4

SparkQA · 2018-03-26T23:29:15Z

Test build #88606 has finished for PR 20909 at commit 8a965a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

only a very small number of tests require hive?
https://github.com/bersprockets/spark/blob/8a965a51be6190f0db864ca7b1ba37269b3a55bc/python/pyspark/sql/tests.py#L3004

and for these it skip automatically (not fail) if jar is not built with hive. so I'm not sure we should raise exception here

bersprockets · 2018-03-27T22:25:50Z

Thanks @felixcheung . Turns out HiveSparkSubmitTests will fail if Spark is not built with the hive profile (AssertionError: 0 != 1).

In addition, at least one pyspark.sql.readwriter docstring test fails.

This PR uses pyspark.sql.tests as the "leader of the pack" (run-tests.py gives it priortity 0 amongst the pyspark.sql tests) to check for prerequisites for its own tests as well as the sql docstring tests. The docstring tests can't make these checks.

I modeled this after pyspark/streaming/tests.py, which checks for prereqs and raises exceptions with a useful message so one can get past the error (although pyspark/streaming/tests.py only checks for its own prereqs, not those required by streaming docstring tests).

HyukjinKwon · 2018-03-28T07:24:35Z

@bersprockets, do you have the error messages? I could (will) check it by myself in the following week but want to take a quick look if you already have them.

HyukjinKwon · 2018-03-28T07:25:44Z

python/pyspark/sql/tests.py

+    return len(files) > 0
+
+
+def search_hive_assembly_jars():


Quick note: I think check_hive_assembly_jars would make more sense.

felixcheung · 2018-03-28T07:49:15Z

maybe the approach in hiveContextSQLTests is better for HiveSparkSubmitTests?
https://github.com/bersprockets/spark/blob/8a965a51be6190f0db864ca7b1ba37269b3a55bc/python/pyspark/sql/tests.py#L3112

bersprockets · 2018-03-28T22:06:55Z

@HyukjinKwon The HiveSparkSubmitTests error message is here

I propose the following:

Fix HiveSparkSubmitTests according to @felixcheung's suggestion. After that fix, tests.py won't need the checks.
Move the Hive assembly check to pyspark.sql.readwriter's _test() function.
Move the test UDF check to pyspark.sql.udf's _test() function.

SparkQA · 2018-03-30T05:44:18Z

Test build #88736 has finished for PR 20909 at commit 0f830e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-01T07:17:29Z

I modeled this after pyspark/streaming/tests.py, which checks for prereqs and raises exceptions with a useful message so one can get past the error (although pyspark/streaming/tests.py only checks for its own prereqs, not those required by streaming docstring tests).

I actually have been thinking about skipping and proceeding the tests in pyspark/streaming/tests.py with an explicit message as well. Can we skip and continue the tests? I think we basically should just skip the tests.

HyukjinKwon · 2018-04-01T07:18:09Z

python/pyspark/sql/tests.py

@@ -2977,6 +2977,20 @@ def test_create_dateframe_from_pandas_with_dst(self):

 class HiveSparkSubmitTests(SparkSubmitTests):

+    @classmethod
+    def setUpClass(cls):


I think this way is more correct, as @felixcheung pointed out.

bersprockets · 2018-04-02T23:53:30Z

actually have been thinking about skipping and proceeding the tests in pyspark/streaming/tests.py with an explicit message as well. Can we skip and continue the tests?

Hi @HyukjinKwon I just want to verify your comment: if hive assembly is missing, readwriter.py should not fail, but instead skip running its doctests. Also, in that case, there should be a message indicating that the tests were skipped.

HyukjinKwon · 2018-04-03T01:39:41Z

Yes, I feel sure that's more consistent and correct.

HyukjinKwon · 2018-04-03T01:56:58Z

@holdenk, I just saw https://issues.apache.org/jira/browse/SPARK-23853. I think this PR could fix it together if I understood correctly :-).

bersprockets · 2018-04-03T04:02:56Z

@HyukjinKwon That makes sense.

Note that when the tests are run using python/run-tests, run-tests.py steals stdout and stderr. I would need to make a small change to run-tests.py to detect when a tests is skipped (maybe through retcode) and print the message (from test test's stdout or stderr).

One other thing. I checked readwriter.py more closely, and there is only a single docstring test that requires Hive:

>>> spark.read.table('tmpTable').dtypes

I added # doctest: +SKIP to that one line and all the tests passed. Rather than sometimes skipping all readerwriter tests, maybe we should just always skip that single test.

udf.py, on the other hand, has lots of docstring tests that require the test udf files.

HyukjinKwon · 2018-04-03T04:23:55Z

Yea, I know the hidden output in the console and I believe that's a known issue. In my case, I made such change before - #20487. Also see the discussion in #20465.

The thing is, it needs duplicated changes to print out the warnings and that's why I have been hesitant to fix related code paths.

Actually, I was thinking we should resemble what we do in streaming.py to skip the doctests although I haven't taken a close look to check if we can control function level yet.

I know we use # doctest: +SKIP here and there in particular with Pandas / Arrow. I think basically we should remove this and do the same things to test them when possible.

I am sure on this too (and told few committers before that I am thinking in this way). Let me cc @cloud-fan, @ueshin and @BryanCutler FYI.

For the best, can you investigate and try to explicitly skip some doctests conditionally? For console output from our test script, I think we can do this separately (but please leave a comment as a todo or JIRA).

bersprockets · 2018-04-03T04:42:08Z

can you investigate and try to explicitly skip some doctests conditionally?

@HyukjinKwon I will take a look to see how that can be done.

SparkQA · 2018-04-04T00:19:13Z

Test build #88863 has finished for PR 20909 at commit db14acb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-09T15:30:06Z

python/pyspark/sql/udf.py

+        # have been skipped.
+        m = pyspark.sql.udf
+        m.__dict__["UDFRegistration"].__dict__["registerJavaFunction"].__doc__ = ""
+        m.__dict__["UDFRegistration"].__dict__["registerJavaUDAF"].__doc__ = ""


ah.. hmm... yea this one was the last resort I was thinking ... let me investigate other possible ways for some more days.

HyukjinKwon · 2018-04-12T13:38:33Z

cc @viirya too.

dongjoon-hyun · 2018-04-24T16:06:40Z

python/pyspark/sql/readwriter.py

+        # has been skipped.
+        m = pyspark.sql.readwriter
+        m.__dict__["DataFrameReader"].__dict__["table"].__doc__ = ""
+


Thank you for pointer, @HyukjinKwon .

For readwriter.py, we had better test without Hive. How do you think , @HyukjinKwon and @bersprockets ?

- spark = SparkSession.builder.enableHiveSupport().getOrCreate() + spark = SparkSession.builder.getOrCreate()

Yup, it looks better.

@dongjoon-hyun Sounds good. That change will be done in PR #21141, correct?

@bersprockets I was thinking like that but wanted to ask your thought per this PR. I am okay with either way.

@HyukjinKwon @dongjoon-hyun I agree, it should go in PR #21141.

HyukjinKwon · 2018-06-09T09:05:12Z

ok to test

SparkQA · 2018-06-09T14:08:15Z

Test build #91612 has finished for PR 20909 at commit db14acb.

This patch fails from timeout after a configured wait of `300m`.
This patch does not merge cleanly.
This patch adds no public classes.

bersprockets · 2018-06-09T17:28:45Z

@HyukjinKwon This PR is mostly obsolete. I will close it and re-open something smaller... maybe a one-line documentation change to handle the missing UDF case for those who build with sbt.

initial commit: check for test udf and hive assembly

8a965a5

felixcheung reviewed Mar 27, 2018

View reviewed changes

HyukjinKwon reviewed Mar 28, 2018

View reviewed changes

bersprockets added 3 commits March 29, 2018 16:21

Move component checks from pyspark.sql.tests to the modules that care

01857e1

Cleanup

4eeca66

Skip HiveSparkSubmitTests if Hive not available

0f830e2

HyukjinKwon reviewed Apr 1, 2018

View reviewed changes

bersprockets added 3 commits April 3, 2018 07:16

If Hive not enabled, skip doctests that will fail

85c3a21

If test udf files not compiled, skip doctests that will fail

28f0548

Clean up

db14acb

HyukjinKwon reviewed Apr 9, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Apr 12, 2018

[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener #21007

Closed

HyukjinKwon mentioned this pull request Apr 24, 2018

[SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark tests only for -Phive #21141

Closed

dongjoon-hyun reviewed Apr 24, 2018

View reviewed changes

bersprockets closed this Jun 9, 2018

bersprockets deleted the SPARK-23776 branch January 31, 2019 18:43

[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

Uh oh!

Conversation

bersprockets commented Mar 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 26, 2018

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

bersprockets commented Mar 27, 2018

Uh oh!

HyukjinKwon commented Mar 28, 2018

Uh oh!

HyukjinKwon Mar 28, 2018

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Mar 28, 2018

Uh oh!

bersprockets commented Mar 28, 2018

Uh oh!

SparkQA commented Mar 30, 2018

Uh oh!

HyukjinKwon commented Apr 1, 2018

Uh oh!

HyukjinKwon Apr 1, 2018

Choose a reason for hiding this comment

Uh oh!

bersprockets commented Apr 2, 2018

Uh oh!

HyukjinKwon commented Apr 3, 2018

Uh oh!

HyukjinKwon commented Apr 3, 2018

Uh oh!

bersprockets commented Apr 3, 2018

Uh oh!

HyukjinKwon commented Apr 3, 2018

Uh oh!

bersprockets commented Apr 3, 2018

Uh oh!

SparkQA commented Apr 4, 2018

Uh oh!

HyukjinKwon Apr 9, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 12, 2018

Uh oh!

dongjoon-hyun Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

bersprockets Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

bersprockets Apr 25, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

bersprockets commented Jun 9, 2018

Uh oh!

Uh oh!

dongjoon-hyun Apr 24, 2018 •

edited

Loading