[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441

dusenberrymw · 2015-11-03T18:58:49Z

This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

RowMatrix ^[1]
1. computeGramianMatrix
2. computeCovariance
3. computeColumnSummaryStatistics
4. columnSimilarities
5. tallSkinnyQR ^[2]
IndexedRowMatrix ^[3]
1. computeGramianMatrix
CoordinateMatrix
1. transpose
BlockMatrix
1. validate
2. cache
3. persist
4. transpose

[1]: Note: multiply, computeSVD, and computePrincipalComponents are already part of PR #7963 for SPARK-6227.
[2]: Implementing tallSkinnyQR uncovered a bug with our PySpark RowMatrix constructor. As discussed on the dev list here, there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a RowMatrix from an RDD[Vector] in PythonMLlibAPI, the Vector type is erased, resulting in an RDD[Object]. Thus, when calling Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an Object cannot be cast to a Spark Vector. As noted in the aforementioned dev list thread, this issue was also encountered with DecisionTrees, and the fix involved an explicit retag of the RDD with a Vector type. Thus, this PR currently contains that fix applied to the createRowMatrix helper function in PythonMLlibAPI. IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely due to their related helper functions in PythonMLlibAPI creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
[3]: Note: multiply and computeSVD are already part of PR #7963 for SPARK-6227.

dusenberrymw · 2015-11-03T18:59:30Z

@holdenk Could you review this and provide any thoughts you may have?

SparkQA · 2015-11-03T19:45:24Z

Test build #44943 has finished for PR 9441 at commit cbddf10.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class QRDecomposition(object):\n

SparkQA · 2015-11-03T21:21:03Z

Test build #44954 has finished for PR 9441 at commit 9b5b7ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class QRDecomposition(object):\n

holdenk · 2015-11-04T00:49:44Z

python/pyspark/mllib/linalg/distributed.py

+        ...                           MatrixEntry(2, 1, 3.7)])
+        >>> mat = CoordinateMatrix(entries)
+        >>> mat_transposed = mat.transpose()
+


Is this blank line intentional?

Yeah, I like the visual clarity when viewing these tests on the Python docs, as it helps indicate that the following two tests rely on the data structures formed above. This is generally the pattern I've followed with these classes for cases with >1 test.

dusenberrymw · 2015-11-04T01:15:38Z

@holdenk Great, thanks for the feedback!

dusenberrymw · 2015-11-04T02:42:08Z

Talking with @holdenk, I've decided to pull the retag fix out into a separate JIRA/PR that blocks this. I've opened #9458 to address that issue, so once that is merged, I'll remove that fix from this PR and then rebase.

…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…rasure Issue As noted in PR apache#9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks apache#9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes apache#9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

SparkQA · 2016-01-12T00:40:30Z

Test build #49192 has finished for PR 9441 at commit 9c530f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dusenberrymw · 2016-01-12T00:43:08Z

@jkbradley Now that #9458 has been merged, this is ready for review.

dusenberrymw · 2016-01-21T18:57:18Z

ping @jkbradley

dusenberrymw · 2016-04-20T17:32:37Z

@MLnick Thoughts on merging this? It's been sitting for quite some time now, and is just a followup to a few previous commits.

MLnick · 2016-04-21T14:56:30Z

python/pyspark/mllib/linalg/distributed.py

@@ -151,6 +153,151 @@ def numCols(self):
        """
        return self._java_matrix_wrapper.call("numCols")

+    def computeColumnSummaryStatistics(self):


Do these need @since annotations?

Yeah probably, although they would have been a little outdated if I had originally added them. :D

MLnick · 2016-04-21T15:12:25Z

@dusenberrymw made a high-level pass and generally looks good. I'll go through it again in more detail soon, in particular checking the test cases.

MLnick · 2016-04-21T15:12:45Z

jenkins retest this please

SparkQA · 2016-04-21T15:29:21Z

Test build #56549 has finished for PR 9441 at commit 9c530f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-21T17:49:25Z

Test build #56560 has finished for PR 9441 at commit 9e05eba.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-21T18:44:59Z

Test build #56565 has finished for PR 9441 at commit 0f82902.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…date, transpose.

…trix. Note that 'multiply' and 'computeSVD' are part of the SPARK-6227 PR.

…computeCovariance, computeColumnSummaryStatistics, columnSimilarities, tallSkinnyQR.

dusenberrymw · 2016-04-21T22:25:38Z

@MLnick I've addressed the comments and added the subtract(...) method. Thanks!

SparkQA · 2016-04-21T22:36:14Z

Test build #56597 has finished for PR 9441 at commit c98f6eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-21T22:42:04Z

Test build #56598 has finished for PR 9441 at commit c0c9565.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dusenberrymw · 2016-04-25T17:17:39Z

@MLnick Any additional thoughts on this, or is it ready to merge?

MLnick · 2016-04-26T19:27:41Z

jenkins retest this please

SparkQA · 2016-04-26T20:10:55Z

Test build #57019 has finished for PR 9441 at commit c0c9565.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-27T17:48:55Z

LGTM, thanks! Merged to master.

dusenberrymw · 2016-04-27T17:52:30Z

Awesome, thanks!

holdenk reviewed Nov 4, 2015
View reviewed changes

dusenberrymw mentioned this pull request Nov 4, 2015

[SPARK-11497] [MLlib] [Python] PySpark RowMatrix Constructor Has Type Erasure Issue #9458

Closed

dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from 9b5b7ae to 9c530f6 Compare January 12, 2016 00:13

dusenberrymw changed the title ~~[WIP] [SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes~~ [SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes Jan 12, 2016

MLnick reviewed Apr 21, 2016
View reviewed changes

dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from 9e05eba to 0f82902 Compare April 21, 2016 18:02

dusenberrymw added 3 commits April 21, 2016 14:55

Adding remaining methods to PySpark BlockMatrix: cache, persist, vali…

f503890

…date, transpose.

Adding remaining method to PySpark CoordinateMatrix: transpose.

edf3e45

Adding remaining method to PySpark IndexedRowMatrix: computeGramianMa…

e153279

…trix. Note that 'multiply' and 'computeSVD' are part of the SPARK-6227 PR.

dusenberrymw added 5 commits April 21, 2016 14:55

Adding remaining methods to PySpark RowMatrix: computeGramianMatrix, …

587bea5

…computeCovariance, computeColumnSummaryStatistics, columnSimilarities, tallSkinnyQR.

Improving robustness of PySpark test for Python 2.6.

12baa78

Adding experimental tag to QRDecomposition.

fca41ca

Adding @SInCE annotations, and adding a comment to Scala.

f1410bf

Adding the subtract method.

c0c9565

dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from c98f6eb to c0c9565 Compare April 21, 2016 21:58

asfgit closed this in 607f503 Apr 27, 2016

[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441

[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441

Uh oh!

Conversation

dusenberrymw commented Nov 3, 2015

Uh oh!

dusenberrymw commented Nov 3, 2015

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

holdenk Nov 4, 2015

Choose a reason for hiding this comment

Uh oh!

dusenberrymw Nov 4, 2015

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Nov 4, 2015

Uh oh!

dusenberrymw commented Nov 4, 2015

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

dusenberrymw commented Jan 12, 2016

Uh oh!

dusenberrymw commented Jan 21, 2016

Uh oh!

dusenberrymw commented Apr 20, 2016

Uh oh!

MLnick Apr 21, 2016

Choose a reason for hiding this comment

Uh oh!

dusenberrymw Apr 21, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Apr 21, 2016

Uh oh!

MLnick commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

dusenberrymw commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

dusenberrymw commented Apr 25, 2016

Uh oh!

MLnick commented Apr 26, 2016

Uh oh!

SparkQA commented Apr 26, 2016

Uh oh!

MLnick commented Apr 27, 2016

Uh oh!

dusenberrymw commented Apr 27, 2016

Uh oh!

Uh oh!