Skip to content

[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Change pyspark-sql tests to check the following:

  • Spark was built with the Hive profile
  • Spark scala tests were compiled

If either condition is not met, throw an exception with a message explaining how to appropriately build Spark.

These checks are similar to the ones found in the pyspark-streaming tests.

These required files will be missing if you follow the sbt build instructions. They are less likely to be missing if you follow the mvn build instructions (mvn compiles the test scala files, and there are mvn build instructions for running the pyspark tests).

How was this patch tested?

For sbt build:

  • run ./build/sbt package
  • run python/run-tests --modules "pyspark-sql" --python-executables python2.7
  • see failure, follow sbt instructions in exception message
  • run test again
  • see second failure (sbt only), follow sbt instructions in exception message
  • run test again, verify success
  • repeat for python3.4

For mvn build:

  • run ./build/mvn -DskipTests clean package
  • run python/run-tests --modules "pyspark-sql" --python-executables python2.7
  • see failure, follow mvn instructions in exception message
  • run test again, verify success
  • repeat for python3.4

@SparkQA
Copy link

SparkQA commented Mar 26, 2018

Test build #88606 has finished for PR 20909 at commit 8a965a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a very small number of tests require hive?
https://github.com/bersprockets/spark/blob/8a965a51be6190f0db864ca7b1ba37269b3a55bc/python/pyspark/sql/tests.py#L3004

and for these it skip automatically (not fail) if jar is not built with hive. so I'm not sure we should raise exception here

@bersprockets
Copy link
Contributor Author

Thanks @felixcheung . Turns out HiveSparkSubmitTests will fail if Spark is not built with the hive profile (AssertionError: 0 != 1).

In addition, at least one pyspark.sql.readwriter docstring test fails.

This PR uses pyspark.sql.tests as the "leader of the pack" (run-tests.py gives it priortity 0 amongst the pyspark.sql tests) to check for prerequisites for its own tests as well as the sql docstring tests. The docstring tests can't make these checks.

I modeled this after pyspark/streaming/tests.py, which checks for prereqs and raises exceptions with a useful message so one can get past the error (although pyspark/streaming/tests.py only checks for its own prereqs, not those required by streaming docstring tests).

@HyukjinKwon
Copy link
Member

@bersprockets, do you have the error messages? I could (will) check it by myself in the following week but want to take a quick look if you already have them.

return len(files) > 0


def search_hive_assembly_jars():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note: I think check_hive_assembly_jars would make more sense.

@felixcheung
Copy link
Member

maybe the approach in hiveContextSQLTests is better for HiveSparkSubmitTests?
https://github.com/bersprockets/spark/blob/8a965a51be6190f0db864ca7b1ba37269b3a55bc/python/pyspark/sql/tests.py#L3112

@bersprockets
Copy link
Contributor Author

@HyukjinKwon The HiveSparkSubmitTests error message is here

I propose the following:

  • Fix HiveSparkSubmitTests according to @felixcheung's suggestion. After that fix, tests.py won't need the checks.
  • Move the Hive assembly check to pyspark.sql.readwriter's _test() function.
  • Move the test UDF check to pyspark.sql.udf's _test() function.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88736 has finished for PR 20909 at commit 0f830e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I modeled this after pyspark/streaming/tests.py, which checks for prereqs and raises exceptions with a useful message so one can get past the error (although pyspark/streaming/tests.py only checks for its own prereqs, not those required by streaming docstring tests).

I actually have been thinking about skipping and proceeding the tests in pyspark/streaming/tests.py with an explicit message as well. Can we skip and continue the tests? I think we basically should just skip the tests.

@@ -2977,6 +2977,20 @@ def test_create_dateframe_from_pandas_with_dst(self):

class HiveSparkSubmitTests(SparkSubmitTests):

@classmethod
def setUpClass(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way is more correct, as @felixcheung pointed out.

@bersprockets
Copy link
Contributor Author

actually have been thinking about skipping and proceeding the tests in pyspark/streaming/tests.py with an explicit message as well. Can we skip and continue the tests?

Hi @HyukjinKwon I just want to verify your comment: if hive assembly is missing, readwriter.py should not fail, but instead skip running its doctests. Also, in that case, there should be a message indicating that the tests were skipped.

@HyukjinKwon
Copy link
Member

Yes, I feel sure that's more consistent and correct.

@HyukjinKwon
Copy link
Member

@holdenk, I just saw https://issues.apache.org/jira/browse/SPARK-23853. I think this PR could fix it together if I understood correctly :-).

@bersprockets
Copy link
Contributor Author

@HyukjinKwon That makes sense.

Note that when the tests are run using python/run-tests, run-tests.py steals stdout and stderr. I would need to make a small change to run-tests.py to detect when a tests is skipped (maybe through retcode) and print the message (from test test's stdout or stderr).

One other thing. I checked readwriter.py more closely, and there is only a single docstring test that requires Hive:

>>> spark.read.table('tmpTable').dtypes

I added # doctest: +SKIP to that one line and all the tests passed. Rather than sometimes skipping all readerwriter tests, maybe we should just always skip that single test.

udf.py, on the other hand, has lots of docstring tests that require the test udf files.

@HyukjinKwon
Copy link
Member

Yea, I know the hidden output in the console and I believe that's a known issue. In my case, I made such change before - #20487. Also see the discussion in #20465.

The thing is, it needs duplicated changes to print out the warnings and that's why I have been hesitant to fix related code paths.

Actually, I was thinking we should resemble what we do in streaming.py to skip the doctests although I haven't taken a close look to check if we can control function level yet.

I know we use # doctest: +SKIP here and there in particular with Pandas / Arrow. I think basically we should remove this and do the same things to test them when possible.

I am sure on this too (and told few committers before that I am thinking in this way). Let me cc @cloud-fan, @ueshin and @BryanCutler FYI.

For the best, can you investigate and try to explicitly skip some doctests conditionally? For console output from our test script, I think we can do this separately (but please leave a comment as a todo or JIRA).

@bersprockets
Copy link
Contributor Author

can you investigate and try to explicitly skip some doctests conditionally?

@HyukjinKwon I will take a look to see how that can be done.

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Test build #88863 has finished for PR 20909 at commit db14acb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# have been skipped.
m = pyspark.sql.udf
m.__dict__["UDFRegistration"].__dict__["registerJavaFunction"].__doc__ = ""
m.__dict__["UDFRegistration"].__dict__["registerJavaUDAF"].__doc__ = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah.. hmm... yea this one was the last resort I was thinking ... let me investigate other possible ways for some more days.

@HyukjinKwon
Copy link
Member

cc @viirya too.

# has been skipped.
m = pyspark.sql.readwriter
m.__dict__["DataFrameReader"].__dict__["table"].__doc__ = ""

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointer, @HyukjinKwon .

For readwriter.py, we had better test without Hive. How do you think , @HyukjinKwon and @bersprockets ?

- spark = SparkSession.builder.enableHiveSupport().getOrCreate()
+ spark = SparkSession.builder.getOrCreate()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it looks better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Sounds good. That change will be done in PR #21141, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bersprockets I was thinking like that but wanted to ask your thought per this PR. I am okay with either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @dongjoon-hyun I agree, it should go in PR #21141.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91612 has finished for PR 20909 at commit db14acb.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Contributor Author

@HyukjinKwon This PR is mostly obsolete. I will close it and re-open something smaller... maybe a one-line documentation change to handle the missing UDF case for those who build with sbt.

@bersprockets bersprockets deleted the SPARK-23776 branch January 31, 2019 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants