[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

HyukjinKwon · 2018-02-07T14:35:44Z

This PR backports #20487 to branch-2.3.

…n PySpark tests (to skip or test) This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip. (cherry picked from commit 71cfba0) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

SparkQA · 2018-02-07T18:12:38Z

Test build #87165 has finished for PR 20534 at commit ff9ba5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-08T00:31:22Z

retest this please

ueshin · 2018-02-08T01:59:21Z

LGTM.

ueshin · 2018-02-08T02:01:35Z

@HyukjinKwon You can include the fix #20538 or backport it after this is merged. It's up to you.

HyukjinKwon · 2018-02-08T02:15:08Z

Yea, will deal with it. Thanks for the reminder!

SparkQA · 2018-02-08T03:42:26Z

Test build #87180 has finished for PR 20534 at commit ff9ba5e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-08T03:45:16Z

Let me just pick up the followup here.

## What changes were proposed in this pull request? This is a followup pr of apache#20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.

SparkQA · 2018-02-08T07:25:52Z

Test build #87185 has finished for PR 20534 at commit c110e34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…w versions in PySpark tests (to skip or test) This PR backports #20487 to branch-2.3. Author: hyukjinkwon <gurwls223@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.

HyukjinKwon · 2018-02-08T07:48:06Z

Merged to branch-2.3.

HyukjinKwon closed this Feb 8, 2018

HyukjinKwon deleted the PR_TOOL_PICK_PR_20487_BRANCH-2.3 branch October 16, 2018 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

Uh oh!

HyukjinKwon commented Feb 7, 2018

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

ueshin commented Feb 8, 2018

Uh oh!

ueshin commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

Uh oh!

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

Uh oh!

Conversation

HyukjinKwon commented Feb 7, 2018

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

ueshin commented Feb 8, 2018

Uh oh!

ueshin commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

HyukjinKwon commented Feb 8, 2018

Uh oh!

Uh oh!