-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #88606 has finished for PR 20909 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a very small number of tests require hive?
https://github.com/bersprockets/spark/blob/8a965a51be6190f0db864ca7b1ba37269b3a55bc/python/pyspark/sql/tests.py#L3004
and for these it skip automatically (not fail) if jar is not built with hive. so I'm not sure we should raise exception here
Thanks @felixcheung . Turns out HiveSparkSubmitTests will fail if Spark is not built with the hive profile (AssertionError: 0 != 1). In addition, at least one pyspark.sql.readwriter docstring test fails. This PR uses pyspark.sql.tests as the "leader of the pack" (run-tests.py gives it priortity 0 amongst the pyspark.sql tests) to check for prerequisites for its own tests as well as the sql docstring tests. The docstring tests can't make these checks. I modeled this after pyspark/streaming/tests.py, which checks for prereqs and raises exceptions with a useful message so one can get past the error (although pyspark/streaming/tests.py only checks for its own prereqs, not those required by streaming docstring tests). |
@bersprockets, do you have the error messages? I could (will) check it by myself in the following week but want to take a quick look if you already have them. |
python/pyspark/sql/tests.py
Outdated
return len(files) > 0 | ||
|
||
|
||
def search_hive_assembly_jars(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note: I think check_hive_assembly_jars
would make more sense.
maybe the approach in hiveContextSQLTests is better for HiveSparkSubmitTests? |
@HyukjinKwon The HiveSparkSubmitTests error message is here I propose the following:
|
Test build #88736 has finished for PR 20909 at commit
|
I actually have been thinking about skipping and proceeding the tests in |
@@ -2977,6 +2977,20 @@ def test_create_dateframe_from_pandas_with_dst(self): | |||
|
|||
class HiveSparkSubmitTests(SparkSubmitTests): | |||
|
|||
@classmethod | |||
def setUpClass(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this way is more correct, as @felixcheung pointed out.
Hi @HyukjinKwon I just want to verify your comment: if hive assembly is missing, readwriter.py should not fail, but instead skip running its doctests. Also, in that case, there should be a message indicating that the tests were skipped. |
Yes, I feel sure that's more consistent and correct. |
@holdenk, I just saw https://issues.apache.org/jira/browse/SPARK-23853. I think this PR could fix it together if I understood correctly :-). |
@HyukjinKwon That makes sense. Note that when the tests are run using python/run-tests, run-tests.py steals stdout and stderr. I would need to make a small change to run-tests.py to detect when a tests is skipped (maybe through retcode) and print the message (from test test's stdout or stderr). One other thing. I checked readwriter.py more closely, and there is only a single docstring test that requires Hive: >>> spark.read.table('tmpTable').dtypes I added udf.py, on the other hand, has lots of docstring tests that require the test udf files. |
Yea, I know the hidden output in the console and I believe that's a known issue. In my case, I made such change before - #20487. Also see the discussion in #20465. The thing is, it needs duplicated changes to print out the warnings and that's why I have been hesitant to fix related code paths. Actually, I was thinking we should resemble what we do in I know we use I am sure on this too (and told few committers before that I am thinking in this way). Let me cc @cloud-fan, @ueshin and @BryanCutler FYI. For the best, can you investigate and try to explicitly skip some doctests conditionally? For console output from our test script, I think we can do this separately (but please leave a comment as a todo or JIRA). |
@HyukjinKwon I will take a look to see how that can be done. |
Test build #88863 has finished for PR 20909 at commit
|
# have been skipped. | ||
m = pyspark.sql.udf | ||
m.__dict__["UDFRegistration"].__dict__["registerJavaFunction"].__doc__ = "" | ||
m.__dict__["UDFRegistration"].__dict__["registerJavaUDAF"].__doc__ = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah.. hmm... yea this one was the last resort I was thinking ... let me investigate other possible ways for some more days.
cc @viirya too. |
# has been skipped. | ||
m = pyspark.sql.readwriter | ||
m.__dict__["DataFrameReader"].__dict__["table"].__doc__ = "" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointer, @HyukjinKwon .
For readwriter.py
, we had better test without Hive. How do you think , @HyukjinKwon and @bersprockets ?
- spark = SparkSession.builder.enableHiveSupport().getOrCreate()
+ spark = SparkSession.builder.getOrCreate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Sounds good. That change will be done in PR #21141, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bersprockets I was thinking like that but wanted to ask your thought per this PR. I am okay with either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @dongjoon-hyun I agree, it should go in PR #21141.
ok to test |
Test build #91612 has finished for PR 20909 at commit
|
@HyukjinKwon This PR is mostly obsolete. I will close it and re-open something smaller... maybe a one-line documentation change to handle the missing UDF case for those who build with sbt. |
What changes were proposed in this pull request?
Change pyspark-sql tests to check the following:
If either condition is not met, throw an exception with a message explaining how to appropriately build Spark.
These checks are similar to the ones found in the pyspark-streaming tests.
These required files will be missing if you follow the sbt build instructions. They are less likely to be missing if you follow the mvn build instructions (mvn compiles the test scala files, and there are mvn build instructions for running the pyspark tests).
How was this patch tested?
For sbt build:
For mvn build: