[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+

HyukjinKwon · HyukjinKwon · commit 044b33b2ed2d · 2018-07-07T11:39:29.000+08:00
## What changes were proposed in this pull request? This PR proposes to make PySpark's tests compatible with NumPy 0.14+ NumPy 0.14.x introduced rather radical changes about its string representation. For example, the tests below are failed: ``` ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 2.], [ 1., 3.]]) Got: DenseMatrix([[0., 2.], [1., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 1.], [ 2., 3.]]) Got: DenseMatrix([[0., 1.], [2., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray Failed example: m.toArray() Expected: array([[ 0., 2.], [ 1., 3.]]) Got: array([[0., 2.], [1., 3.]]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot Failed example: dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F')) Expected: array([ 5., 11.]) Got: array([ 5., 11.]) ********************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot Failed example: a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]])) Expected: array([ 22., 22.]) Got: array([22., 22.]) ``` See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python2.7'] Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib'] Starting test(python2.7): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.classification Starting test(python3.6): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.clustering Finished test(python2.7): pyspark.ml.clustering (54s) Starting test(python2.7): pyspark.ml.evaluation Finished test(python2.7): pyspark.ml.classification (74s) Starting test(python2.7): pyspark.ml.feature Finished test(python2.7): pyspark.ml.evaluation (27s) Starting test(python2.7): pyspark.ml.fpm Finished test(python2.7): pyspark.ml.fpm (0s) Starting test(python2.7): pyspark.ml.image Finished test(python2.7): pyspark.ml.image (17s) Starting test(python2.7): pyspark.ml.linalg.__init__ Finished test(python2.7): pyspark.ml.linalg.__init__ (1s) Starting test(python2.7): pyspark.ml.recommendation Finished test(python2.7): pyspark.ml.feature (76s) Starting test(python2.7): pyspark.ml.regression Finished test(python2.7): pyspark.ml.recommendation (69s) Starting test(python2.7): pyspark.ml.stat Finished test(python2.7): pyspark.ml.regression (45s) Starting test(python2.7): pyspark.ml.tests Finished test(python2.7): pyspark.ml.stat (28s) Starting test(python2.7): pyspark.ml.tuning Finished test(python2.7): pyspark.ml.tuning (20s) Starting test(python2.7): pyspark.mllib.classification Finished test(python2.7): pyspark.mllib.classification (31s) Starting test(python2.7): pyspark.mllib.clustering Finished test(python2.7): pyspark.mllib.tests (260s) Starting test(python2.7): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.tests (266s) Starting test(python2.7): pyspark.mllib.feature Finished test(python2.7): pyspark.mllib.evaluation (21s) Starting test(python2.7): pyspark.mllib.fpm Finished test(python2.7): pyspark.mllib.feature (38s) Starting test(python2.7): pyspark.mllib.linalg.__init__ Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s) Starting test(python2.7): pyspark.mllib.linalg.distributed Finished test(python2.7): pyspark.mllib.fpm (34s) Starting test(python2.7): pyspark.mllib.random Finished test(python2.7): pyspark.mllib.clustering (64s) Starting test(python2.7): pyspark.mllib.recommendation Finished test(python2.7): pyspark.mllib.random (15s) Starting test(python2.7): pyspark.mllib.regression Finished test(python2.7): pyspark.mllib.linalg.distributed (47s) Starting test(python2.7): pyspark.mllib.stat.KernelDensity Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python2.7): pyspark.mllib.stat._statistics Finished test(python2.7): pyspark.mllib.recommendation (40s) Starting test(python2.7): pyspark.mllib.tree Finished test(python2.7): pyspark.mllib.regression (38s) Starting test(python2.7): pyspark.mllib.util Finished test(python2.7): pyspark.mllib.stat._statistics (19s) Starting test(python3.6): pyspark.ml.classification Finished test(python2.7): pyspark.mllib.tree (26s) Starting test(python3.6): pyspark.ml.clustering Finished test(python2.7): pyspark.mllib.util (27s) Starting test(python3.6): pyspark.ml.evaluation Finished test(python3.6): pyspark.ml.evaluation (30s) Starting test(python3.6): pyspark.ml.feature Finished test(python2.7): pyspark.ml.tests (234s) Starting test(python3.6): pyspark.ml.fpm Finished test(python3.6): pyspark.ml.fpm (1s) Starting test(python3.6): pyspark.ml.image Finished test(python3.6): pyspark.ml.clustering (55s) Starting test(python3.6): pyspark.ml.linalg.__init__ Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Starting test(python3.6): pyspark.ml.recommendation Finished test(python3.6): pyspark.ml.classification (71s) Starting test(python3.6): pyspark.ml.regression Finished test(python3.6): pyspark.ml.image (18s) Starting test(python3.6): pyspark.ml.stat Finished test(python3.6): pyspark.ml.stat (37s) Starting test(python3.6): pyspark.ml.tests Finished test(python3.6): pyspark.ml.regression (59s) Starting test(python3.6): pyspark.ml.tuning Finished test(python3.6): pyspark.ml.feature (93s) Starting test(python3.6): pyspark.mllib.classification Finished test(python3.6): pyspark.ml.recommendation (83s) Starting test(python3.6): pyspark.mllib.clustering Finished test(python3.6): pyspark.ml.tuning (29s) Starting test(python3.6): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.evaluation (26s) Starting test(python3.6): pyspark.mllib.feature Finished test(python3.6): pyspark.mllib.classification (43s) Starting test(python3.6): pyspark.mllib.fpm Finished test(python3.6): pyspark.mllib.clustering (81s) Starting test(python3.6): pyspark.mllib.linalg.__init__ Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s) Starting test(python3.6): pyspark.mllib.linalg.distributed Finished test(python3.6): pyspark.mllib.fpm (48s) Starting test(python3.6): pyspark.mllib.random Finished test(python3.6): pyspark.mllib.feature (54s) Starting test(python3.6): pyspark.mllib.recommendation Finished test(python3.6): pyspark.mllib.random (18s) Starting test(python3.6): pyspark.mllib.regression Finished test(python3.6): pyspark.mllib.linalg.distributed (55s) Starting test(python3.6): pyspark.mllib.stat.KernelDensity Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.6): pyspark.mllib.stat._statistics Finished test(python3.6): pyspark.mllib.recommendation (51s) Starting test(python3.6): pyspark.mllib.tree Finished test(python3.6): pyspark.mllib.regression (45s) Starting test(python3.6): pyspark.mllib.util Finished test(python3.6): pyspark.mllib.stat._statistics (21s) Finished test(python3.6): pyspark.mllib.tree (27s) Finished test(python3.6): pyspark.mllib.util (27s) Finished test(python3.6): pyspark.ml.tests (264s) ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21715 from HyukjinKwon/SPARK-24740.
diff --git a/python/pyspark/ml/clustering.py b/python/pyspark/ml/clustering.py
@@ -1345,8 +1345,14 @@ def assignClusters(self, dataset):
 
 if __name__ == "__main__":
     import doctest
+    import numpy
     import pyspark.ml.clustering
     from pyspark.sql import SparkSession
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     globs = pyspark.ml.clustering.__dict__.copy()
     # The small batch size here ensures that we see multiple batches,
     # even in these small test examples:
diff --git a/python/pyspark/ml/linalg/__init__.py b/python/pyspark/ml/linalg/__init__.py
@@ -1156,6 +1156,11 @@ def sparse(numRows, numCols, colPtrs, rowIndices, values):
 
 def _test():
     import doctest
+    try:
+        # Numpy 1.14+ changed it's string format.
+        np.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     (failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
     if failure_count:
         sys.exit(-1)
diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py
@@ -388,8 +388,14 @@ def summary(self, featuresCol, weightCol=None):
 
 if __name__ == "__main__":
     import doctest
+    import numpy
     import pyspark.ml.stat
     from pyspark.sql import SparkSession
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
 
     globs = pyspark.ml.stat.__dict__.copy()
     # The small batch size here ensures that we see multiple batches,
diff --git a/python/pyspark/mllib/clustering.py b/python/pyspark/mllib/clustering.py
@@ -1042,7 +1042,13 @@ def train(cls, rdd, k=10, maxIterations=20, docConcentration=-1.0,
 
 def _test():
     import doctest
+    import numpy
     import pyspark.mllib.clustering
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     globs = pyspark.mllib.clustering.__dict__.copy()
     globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
     (failure_count, test_count) = doctest.testmod(globs=globs, optionflags=doctest.ELLIPSIS)
diff --git a/python/pyspark/mllib/evaluation.py b/python/pyspark/mllib/evaluation.py
@@ -532,8 +532,14 @@ def accuracy(self):
 
 def _test():
     import doctest
+    import numpy
     from pyspark.sql import SparkSession
     import pyspark.mllib.evaluation
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     globs = pyspark.mllib.evaluation.__dict__.copy()
     spark = SparkSession.builder\
         .master("local[4]")\
diff --git a/python/pyspark/mllib/linalg/__init__.py b/python/pyspark/mllib/linalg/__init__.py
@@ -1368,6 +1368,12 @@ def R(self):
 
 def _test():
     import doctest
+    import numpy
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     (failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
     if failure_count:
         sys.exit(-1)
diff --git a/python/pyspark/mllib/linalg/distributed.py b/python/pyspark/mllib/linalg/distributed.py
@@ -1364,9 +1364,15 @@ def toCoordinateMatrix(self):
 
 def _test():
     import doctest
+    import numpy
     from pyspark.sql import SparkSession
     from pyspark.mllib.linalg import Matrices
     import pyspark.mllib.linalg.distributed
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     globs = pyspark.mllib.linalg.distributed.__dict__.copy()
     spark = SparkSession.builder\
         .master("local[2]")\
diff --git a/python/pyspark/mllib/stat/_statistics.py b/python/pyspark/mllib/stat/_statistics.py
@@ -303,7 +303,13 @@ def kolmogorovSmirnovTest(data, distName="norm", *params):
 
 def _test():
     import doctest
+    import numpy
     from pyspark.sql import SparkSession
+    try:
+        # Numpy 1.14+ changed it's string format.
+        numpy.set_printoptions(legacy='1.13')
+    except TypeError:
+        pass
     globs = globals().copy()
     spark = SparkSession.builder\
         .master("local[4]")\