-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@holdenk Could you review this and provide any thoughts you may have? |
Test build #44943 has finished for PR 9441 at commit
|
Test build #44954 has finished for PR 9441 at commit
|
... MatrixEntry(2, 1, 3.7)]) | ||
>>> mat = CoordinateMatrix(entries) | ||
>>> mat_transposed = mat.transpose() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this blank line intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like the visual clarity when viewing these tests on the Python docs, as it helps indicate that the following two tests rely on the data structures formed above. This is generally the pattern I've followed with these classes for cases with >1 test.
@holdenk Great, thanks for the feedback! |
…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…rasure Issue As noted in PR apache#9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks apache#9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes apache#9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
9b5b7ae
to
9c530f6
Compare
Test build #49192 has finished for PR 9441 at commit
|
@jkbradley Now that #9458 has been merged, this is ready for review. |
ping @jkbradley |
@MLnick Thoughts on merging this? It's been sitting for quite some time now, and is just a followup to a few previous commits. |
@@ -151,6 +153,151 @@ def numCols(self): | |||
""" | |||
return self._java_matrix_wrapper.call("numCols") | |||
|
|||
def computeColumnSummaryStatistics(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need @since
annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah probably, although they would have been a little outdated if I had originally added them. :D
@dusenberrymw made a high-level pass and generally looks good. I'll go through it again in more detail soon, in particular checking the test cases. |
jenkins retest this please |
Test build #56549 has finished for PR 9441 at commit
|
Test build #56560 has finished for PR 9441 at commit
|
9e05eba
to
0f82902
Compare
Test build #56565 has finished for PR 9441 at commit
|
…trix. Note that 'multiply' and 'computeSVD' are part of the SPARK-6227 PR.
…computeCovariance, computeColumnSummaryStatistics, columnSimilarities, tallSkinnyQR.
c98f6eb
to
c0c9565
Compare
@MLnick I've addressed the comments and added the |
Test build #56597 has finished for PR 9441 at commit
|
Test build #56598 has finished for PR 9441 at commit
|
@MLnick Any additional thoughts on this, or is it ready to merge? |
jenkins retest this please |
Test build #57019 has finished for PR 9441 at commit
|
LGTM, thanks! Merged to master. |
Awesome, thanks! |
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
RowMatrix
[1]computeGramianMatrix
computeCovariance
computeColumnSummaryStatistics
columnSimilarities
tallSkinnyQR
[2]IndexedRowMatrix
[3]computeGramianMatrix
CoordinateMatrix
transpose
BlockMatrix
validate
cache
persist
transpose
[1]: Note:
multiply
,computeSVD
, andcomputePrincipalComponents
are already part of PR #7963 for SPARK-6227.[2]: Implementing
tallSkinnyQR
uncovered a bug with our PySparkRowMatrix
constructor. As discussed on the dev list here, there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct aRowMatrix
from anRDD[Vector]
in PythonMLlibAPI, theVector
type is erased, resulting in anRDD[Object]
. Thus, when calling Scala'stallSkinnyQR
from PySpark, we get a JavaClassCastException
in which anObject
cannot be cast to a SparkVector
. As noted in the aforementioned dev list thread, this issue was also encountered withDecisionTrees
, and the fix involved an explicitretag
of the RDD with aVector
type. Thus, this PR currently contains that fix applied to thecreateRowMatrix
helper function inPythonMLlibAPI
.IndexedRowMatrix
andCoordinateMatrix
do not appear to have this issue likely due to their related helper functions inPythonMLlibAPI
creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.[3]: Note:
multiply
andcomputeSVD
are already part of PR #7963 for SPARK-6227.