Skip to content

Commit 044b33b

Browse files
committed
[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+
## What changes were proposed in this pull request? This PR proposes to make PySpark's tests compatible with NumPy 0.14+ NumPy 0.14.x introduced rather radical changes about its string representation. For example, the tests below are failed: ``` ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 2.], [ 1., 3.]]) Got: DenseMatrix([[0., 2.], [1., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 1.], [ 2., 3.]]) Got: DenseMatrix([[0., 1.], [2., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray Failed example: m.toArray() Expected: array([[ 0., 2.], [ 1., 3.]]) Got: array([[0., 2.], [1., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot Failed example: dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F')) Expected: array([ 5., 11.]) Got: array([ 5., 11.]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot Failed example: a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]])) Expected: array([ 22., 22.]) Got: array([22., 22.]) ``` See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python2.7'] Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib'] Starting test(python2.7): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.classification Starting test(python3.6): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.clustering Finished test(python2.7): pyspark.ml.clustering (54s) Starting test(python2.7): pyspark.ml.evaluation Finished test(python2.7): pyspark.ml.classification (74s) Starting test(python2.7): pyspark.ml.feature Finished test(python2.7): pyspark.ml.evaluation (27s) Starting test(python2.7): pyspark.ml.fpm Finished test(python2.7): pyspark.ml.fpm (0s) Starting test(python2.7): pyspark.ml.image Finished test(python2.7): pyspark.ml.image (17s) Starting test(python2.7): pyspark.ml.linalg.__init__ Finished test(python2.7): pyspark.ml.linalg.__init__ (1s) Starting test(python2.7): pyspark.ml.recommendation Finished test(python2.7): pyspark.ml.feature (76s) Starting test(python2.7): pyspark.ml.regression Finished test(python2.7): pyspark.ml.recommendation (69s) Starting test(python2.7): pyspark.ml.stat Finished test(python2.7): pyspark.ml.regression (45s) Starting test(python2.7): pyspark.ml.tests Finished test(python2.7): pyspark.ml.stat (28s) Starting test(python2.7): pyspark.ml.tuning Finished test(python2.7): pyspark.ml.tuning (20s) Starting test(python2.7): pyspark.mllib.classification Finished test(python2.7): pyspark.mllib.classification (31s) Starting test(python2.7): pyspark.mllib.clustering Finished test(python2.7): pyspark.mllib.tests (260s) Starting test(python2.7): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.tests (266s) Starting test(python2.7): pyspark.mllib.feature Finished test(python2.7): pyspark.mllib.evaluation (21s) Starting test(python2.7): pyspark.mllib.fpm Finished test(python2.7): pyspark.mllib.feature (38s) Starting test(python2.7): pyspark.mllib.linalg.__init__ Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s) Starting test(python2.7): pyspark.mllib.linalg.distributed Finished test(python2.7): pyspark.mllib.fpm (34s) Starting test(python2.7): pyspark.mllib.random Finished test(python2.7): pyspark.mllib.clustering (64s) Starting test(python2.7): pyspark.mllib.recommendation Finished test(python2.7): pyspark.mllib.random (15s) Starting test(python2.7): pyspark.mllib.regression Finished test(python2.7): pyspark.mllib.linalg.distributed (47s) Starting test(python2.7): pyspark.mllib.stat.KernelDensity Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python2.7): pyspark.mllib.stat._statistics Finished test(python2.7): pyspark.mllib.recommendation (40s) Starting test(python2.7): pyspark.mllib.tree Finished test(python2.7): pyspark.mllib.regression (38s) Starting test(python2.7): pyspark.mllib.util Finished test(python2.7): pyspark.mllib.stat._statistics (19s) Starting test(python3.6): pyspark.ml.classification Finished test(python2.7): pyspark.mllib.tree (26s) Starting test(python3.6): pyspark.ml.clustering Finished test(python2.7): pyspark.mllib.util (27s) Starting test(python3.6): pyspark.ml.evaluation Finished test(python3.6): pyspark.ml.evaluation (30s) Starting test(python3.6): pyspark.ml.feature Finished test(python2.7): pyspark.ml.tests (234s) Starting test(python3.6): pyspark.ml.fpm Finished test(python3.6): pyspark.ml.fpm (1s) Starting test(python3.6): pyspark.ml.image Finished test(python3.6): pyspark.ml.clustering (55s) Starting test(python3.6): pyspark.ml.linalg.__init__ Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Starting test(python3.6): pyspark.ml.recommendation Finished test(python3.6): pyspark.ml.classification (71s) Starting test(python3.6): pyspark.ml.regression Finished test(python3.6): pyspark.ml.image (18s) Starting test(python3.6): pyspark.ml.stat Finished test(python3.6): pyspark.ml.stat (37s) Starting test(python3.6): pyspark.ml.tests Finished test(python3.6): pyspark.ml.regression (59s) Starting test(python3.6): pyspark.ml.tuning Finished test(python3.6): pyspark.ml.feature (93s) Starting test(python3.6): pyspark.mllib.classification Finished test(python3.6): pyspark.ml.recommendation (83s) Starting test(python3.6): pyspark.mllib.clustering Finished test(python3.6): pyspark.ml.tuning (29s) Starting test(python3.6): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.evaluation (26s) Starting test(python3.6): pyspark.mllib.feature Finished test(python3.6): pyspark.mllib.classification (43s) Starting test(python3.6): pyspark.mllib.fpm Finished test(python3.6): pyspark.mllib.clustering (81s) Starting test(python3.6): pyspark.mllib.linalg.__init__ Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s) Starting test(python3.6): pyspark.mllib.linalg.distributed Finished test(python3.6): pyspark.mllib.fpm (48s) Starting test(python3.6): pyspark.mllib.random Finished test(python3.6): pyspark.mllib.feature (54s) Starting test(python3.6): pyspark.mllib.recommendation Finished test(python3.6): pyspark.mllib.random (18s) Starting test(python3.6): pyspark.mllib.regression Finished test(python3.6): pyspark.mllib.linalg.distributed (55s) Starting test(python3.6): pyspark.mllib.stat.KernelDensity Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.6): pyspark.mllib.stat._statistics Finished test(python3.6): pyspark.mllib.recommendation (51s) Starting test(python3.6): pyspark.mllib.tree Finished test(python3.6): pyspark.mllib.regression (45s) Starting test(python3.6): pyspark.mllib.util Finished test(python3.6): pyspark.mllib.stat._statistics (21s) Finished test(python3.6): pyspark.mllib.tree (27s) Finished test(python3.6): pyspark.mllib.util (27s) Finished test(python3.6): pyspark.ml.tests (264s) ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21715 from HyukjinKwon/SPARK-24740.
1 parent 74f6a92 commit 044b33b

File tree

8 files changed

+47
-0
lines changed

8 files changed

+47
-0
lines changed

python/pyspark/ml/clustering.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1345,8 +1345,14 @@ def assignClusters(self, dataset):
13451345

13461346
if __name__ == "__main__":
13471347
import doctest
1348+
import numpy
13481349
import pyspark.ml.clustering
13491350
from pyspark.sql import SparkSession
1351+
try:
1352+
# Numpy 1.14+ changed it's string format.
1353+
numpy.set_printoptions(legacy='1.13')
1354+
except TypeError:
1355+
pass
13501356
globs = pyspark.ml.clustering.__dict__.copy()
13511357
# The small batch size here ensures that we see multiple batches,
13521358
# even in these small test examples:

python/pyspark/ml/linalg/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1156,6 +1156,11 @@ def sparse(numRows, numCols, colPtrs, rowIndices, values):
11561156

11571157
def _test():
11581158
import doctest
1159+
try:
1160+
# Numpy 1.14+ changed it's string format.
1161+
np.set_printoptions(legacy='1.13')
1162+
except TypeError:
1163+
pass
11591164
(failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
11601165
if failure_count:
11611166
sys.exit(-1)

python/pyspark/ml/stat.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,8 +388,14 @@ def summary(self, featuresCol, weightCol=None):
388388

389389
if __name__ == "__main__":
390390
import doctest
391+
import numpy
391392
import pyspark.ml.stat
392393
from pyspark.sql import SparkSession
394+
try:
395+
# Numpy 1.14+ changed it's string format.
396+
numpy.set_printoptions(legacy='1.13')
397+
except TypeError:
398+
pass
393399

394400
globs = pyspark.ml.stat.__dict__.copy()
395401
# The small batch size here ensures that we see multiple batches,

python/pyspark/mllib/clustering.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1042,7 +1042,13 @@ def train(cls, rdd, k=10, maxIterations=20, docConcentration=-1.0,
10421042

10431043
def _test():
10441044
import doctest
1045+
import numpy
10451046
import pyspark.mllib.clustering
1047+
try:
1048+
# Numpy 1.14+ changed it's string format.
1049+
numpy.set_printoptions(legacy='1.13')
1050+
except TypeError:
1051+
pass
10461052
globs = pyspark.mllib.clustering.__dict__.copy()
10471053
globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
10481054
(failure_count, test_count) = doctest.testmod(globs=globs, optionflags=doctest.ELLIPSIS)

python/pyspark/mllib/evaluation.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -532,8 +532,14 @@ def accuracy(self):
532532

533533
def _test():
534534
import doctest
535+
import numpy
535536
from pyspark.sql import SparkSession
536537
import pyspark.mllib.evaluation
538+
try:
539+
# Numpy 1.14+ changed it's string format.
540+
numpy.set_printoptions(legacy='1.13')
541+
except TypeError:
542+
pass
537543
globs = pyspark.mllib.evaluation.__dict__.copy()
538544
spark = SparkSession.builder\
539545
.master("local[4]")\

python/pyspark/mllib/linalg/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1368,6 +1368,12 @@ def R(self):
13681368

13691369
def _test():
13701370
import doctest
1371+
import numpy
1372+
try:
1373+
# Numpy 1.14+ changed it's string format.
1374+
numpy.set_printoptions(legacy='1.13')
1375+
except TypeError:
1376+
pass
13711377
(failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
13721378
if failure_count:
13731379
sys.exit(-1)

python/pyspark/mllib/linalg/distributed.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1364,9 +1364,15 @@ def toCoordinateMatrix(self):
13641364

13651365
def _test():
13661366
import doctest
1367+
import numpy
13671368
from pyspark.sql import SparkSession
13681369
from pyspark.mllib.linalg import Matrices
13691370
import pyspark.mllib.linalg.distributed
1371+
try:
1372+
# Numpy 1.14+ changed it's string format.
1373+
numpy.set_printoptions(legacy='1.13')
1374+
except TypeError:
1375+
pass
13701376
globs = pyspark.mllib.linalg.distributed.__dict__.copy()
13711377
spark = SparkSession.builder\
13721378
.master("local[2]")\

python/pyspark/mllib/stat/_statistics.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,7 +303,13 @@ def kolmogorovSmirnovTest(data, distName="norm", *params):
303303

304304
def _test():
305305
import doctest
306+
import numpy
306307
from pyspark.sql import SparkSession
308+
try:
309+
# Numpy 1.14+ changed it's string format.
310+
numpy.set_printoptions(legacy='1.13')
311+
except TypeError:
312+
pass
307313
globs = globals().copy()
308314
spark = SparkSession.builder\
309315
.master("local[4]")\

0 commit comments

Comments
 (0)