Skip to content

[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dusenberrymw
Copy link
Contributor

This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

  • RowMatrix [1]
    1. computeGramianMatrix
    2. computeCovariance
    3. computeColumnSummaryStatistics
    4. columnSimilarities
    5. tallSkinnyQR [2]
  • IndexedRowMatrix [3]
    1. computeGramianMatrix
  • CoordinateMatrix
    1. transpose
  • BlockMatrix
    1. validate
    2. cache
    3. persist
    4. transpose

[1]: Note: multiply, computeSVD, and computePrincipalComponents are already part of PR #7963 for SPARK-6227.
[2]: Implementing tallSkinnyQR uncovered a bug with our PySpark RowMatrix constructor. As discussed on the dev list here, there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a RowMatrix from an RDD[Vector] in PythonMLlibAPI, the Vector type is erased, resulting in an RDD[Object]. Thus, when calling Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an Object cannot be cast to a Spark Vector. As noted in the aforementioned dev list thread, this issue was also encountered with DecisionTrees, and the fix involved an explicit retag of the RDD with a Vector type. Thus, this PR currently contains that fix applied to the createRowMatrix helper function in PythonMLlibAPI. IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely due to their related helper functions in PythonMLlibAPI creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
[3]: Note: multiply and computeSVD are already part of PR #7963 for SPARK-6227.

@dusenberrymw
Copy link
Contributor Author

@holdenk Could you review this and provide any thoughts you may have?

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44943 has finished for PR 9441 at commit cbddf10.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class QRDecomposition(object):\n

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44954 has finished for PR 9441 at commit 9b5b7ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class QRDecomposition(object):\n

... MatrixEntry(2, 1, 3.7)])
>>> mat = CoordinateMatrix(entries)
>>> mat_transposed = mat.transpose()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blank line intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the visual clarity when viewing these tests on the Python docs, as it helps indicate that the following two tests rely on the data structures formed above. This is generally the pattern I've followed with these classes for cases with >1 test.

@dusenberrymw
Copy link
Contributor Author

@holdenk Great, thanks for the feedback!

@dusenberrymw
Copy link
Contributor Author

Talking with @holdenk, I've decided to pull the retag fix out into a separate JIRA/PR that blocks this. I've opened #9458 to address that issue, so once that is merged, I'll remove that fix from this PR and then rebase.

asfgit pushed a commit that referenced this pull request Dec 11, 2015
…rasure Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b82203)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 11, 2015
…rasure Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b82203)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 11, 2015
…rasure Issue

As noted in PR apache#9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks apache#9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes apache#9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
@dusenberrymw dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from 9b5b7ae to 9c530f6 Compare January 12, 2016 00:13
@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49192 has finished for PR 9441 at commit 9c530f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dusenberrymw dusenberrymw changed the title [WIP] [SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes [SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes Jan 12, 2016
@dusenberrymw
Copy link
Contributor Author

@jkbradley Now that #9458 has been merged, this is ready for review.

@dusenberrymw
Copy link
Contributor Author

ping @jkbradley

@dusenberrymw
Copy link
Contributor Author

@MLnick Thoughts on merging this? It's been sitting for quite some time now, and is just a followup to a few previous commits.

@@ -151,6 +153,151 @@ def numCols(self):
"""
return self._java_matrix_wrapper.call("numCols")

def computeColumnSummaryStatistics(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need @since annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably, although they would have been a little outdated if I had originally added them. :D

@MLnick
Copy link
Contributor

MLnick commented Apr 21, 2016

@dusenberrymw made a high-level pass and generally looks good. I'll go through it again in more detail soon, in particular checking the test cases.

@MLnick
Copy link
Contributor

MLnick commented Apr 21, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56549 has finished for PR 9441 at commit 9c530f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56560 has finished for PR 9441 at commit 9e05eba.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dusenberrymw dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from 9e05eba to 0f82902 Compare April 21, 2016 18:02
@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56565 has finished for PR 9441 at commit 0f82902.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dusenberrymw dusenberrymw force-pushed the SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra branch from c98f6eb to c0c9565 Compare April 21, 2016 21:58
@dusenberrymw
Copy link
Contributor Author

@MLnick I've addressed the comments and added the subtract(...) method. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56597 has finished for PR 9441 at commit c98f6eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56598 has finished for PR 9441 at commit c0c9565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dusenberrymw
Copy link
Contributor Author

@MLnick Any additional thoughts on this, or is it ready to merge?

@MLnick
Copy link
Contributor

MLnick commented Apr 26, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57019 has finished for PR 9441 at commit c0c9565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 27, 2016

LGTM, thanks! Merged to master.

@asfgit asfgit closed this in 607f503 Apr 27, 2016
@dusenberrymw
Copy link
Contributor Author

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants