-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means Clusterer #2634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…or the Spark K-Means clusterer. There are no public API changes*. Distance Function Trait The PointOps trait defines the distance function. The PointOps trait is more than a simple distance function. It also defines the types of Points and Centers for the clusterer. Standard MLLIB Vectors are converted into Points and Centers. In the case of the FastEuclideanOps implementation of PointOps, the Point and Center types includes vector norm members. In other distance functions such as the Kullback-Leibler distance function, the Point and Center types include different values that speed up the distance calculation in a similar way that caching vector norms speeds up the Euclidean distance function. This addresses SPARK-3219. Refactoring To understand this original code, I found it useful to refactor the original implementation into components. You may find it helpful to understand this pull request by looking at the new components and comparing them to their original implementation. Unfortunately, GitHub diff does not help very much with this. This commit splits up the clusterer into a number of components which behave (largely) like their predecessors. KMeansParallel implements the K-Means || initialization algorithm. KMeansRandom implements the K-Means Random initialization algorithm. MultiKMeans implements the K-Means algorithm on multiple sets of cluster centers using a given distance function. Traits for the initializer, KMeansInitializer, and the general K-Means clusterer, MultiKMeansClusterer, are provided to highlight the salient interfaces with the intent that alternative implementations of these interfaces may be provided in the future. Performance This pull request is not focused on performance. Nevertheless, the performance of the KMeans++ implementation was dramatically improved by NOT recomputing distances to clusters centers that were present in previous steps. This turns a quadratic implementation into a linear one. Second, the KMeans++ implementation uses the general K-Means clusterer in the final step. This parallelizes a step that was sequential. Together, these changes address SPARK-3424. Next Steps This pull request does not introduce new user-visible changes. The next step is to make different distance functions available through a user-visible API. I will provide other distance functions after this pull request has been accepted. Then, we can decide on an appropriate user-level API to access those functions. Compatibility While there are no user-level API changes, the behavior of the clusterer is different on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are not filled with random points in this implementation. The former behavior is undesirable for a number a reasons, not the least of which is that there is no reasonable use for duplicate cluster centers. To accommodate the change in behavior, the test cases were changed accordingly. This addresses SPARK-3261. The private K-Means constructor which was used by some test Java code and one example was replaced with a Scala constructor that is not Java friendly. Since the constructor was not user visible, I simply changed the Java test code and the example to use the higher level interface. Testing This code has been tested (albeit while packaged outside of Spark) and performance measured on data sets of millions of features each with hundreds of dimensions and on tens of thousands of clusters.
@mingxr I created a new clean pull request. I still need help to understand/fix a closure that is capturing too much data. |
QA tests have started for PR 2634 at commit
|
QA tests have finished for PR 2634 at commit
|
This is as expected. I need help in solving the closure data capture problem. Sent from my iPhone
|
@derrickburns The This is something we can try. Avoiding serializing unnecessary objects is a good practice, but I'm not sure whether it is worth the effort. Btw, could you update your PR following the https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? Thanks! |
I ran the style tests. They pass. Is there something else in the style guide that is not captured in the tests ? I have expended much effort to avoid serializing unnecessary objects. I'm still perplexed why so much data is being captured in the closure that the test fails. Anyway, what are the next steps? Omit the test and Approve the PR? Ask someone to help fix the code to avoid the unit test failure ? Thx ! Sent from my iPhone
|
@derrickburns The style test doesn't capture all, unfortunately. The Spark Code Style Guide is the first place to check. I will mark a few examples inline. I think the best way to trace down the problem is to split this PR into small ones and see whether we can find it. Is there a good way to split this? Thanks! |
* A clustering model for K-means. Each point belongs to the cluster with the closest center. | ||
*/ | ||
private[mllib] class GeneralizedKMeansModel[P<:FP, C<:FP]( | ||
val pointOps: PointOps[P, C], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4-space indentation
@derrickburns I marked a few style problems (not all of them). There are breaking changes in your PR, which we should avoid as much as possible. Even we want to remove some methods, we should deprecate them first and remove them in a later release. For the serialization problem, I'm not sure whether I have time looking into it this week. It would be nice to split this into small PRs, so we can trace down the problem faster. |
I know exactly which closure is exceeding the size limitation. The problem, On Mon, Oct 6, 2014 at 2:42 PM, Xiangrui Meng notifications@github.com
|
Is there an IntelliJ or Eclipse configuration that i can use to reformat The breaking change in the PR is for a private constructor that is, On Mon, Oct 6, 2014 at 2:53 PM, Xiangrui Meng notifications@github.com
|
That would be great! On Sat, Dec 27, 2014 at 12:59 PM, Nicholas Chammas <notifications@github.com
|
@derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after? |
No problem. Thx! Sent from my iPhone
|
Thx for taking this on. Sent from my iPhone
|
@derrickburns I tried to merge this PR with the current master. But there are many conflicts, majorly because we removed breeze from the implementation. To merge the changes, I think we should make separate PRs for the JIRAs mentioned here, from fixes to new features. Do you mind me splitting this PR? Thanks! |
I've said this before, so please forgive me for being repetitive. The new implementation is a rewrite, not a patch, so it is not possible to On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notifications@github.com
|
The pull request that you integrated on December 3 is redundant to this Are you willing integrate this pull request as is? On Mon, Jan 5, 2015 at 4:01 PM, Derrick Burns derrickrburns@gmail.com
|
(@derrickburns this PR can't be merged as-is. It contains merge conflicts with master now, as Github notes here. You would need to rebase it.) |
@derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (60000x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes.
I think it is still feasible to include features through separate PRs:
Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind. For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors:
where Besides those comments, I'm going to make some minor comments inline. |
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.mllib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be mllib.clustering
as the file is under clustering/
?
Thanks for the information on the speedup that you obtained by eliminating One case certainly split out 1-4 AFTER doing 5. My point was that doing 5
As it turns out, I now have a need to perform clustering on sparse data, so I do not understand you conclusion that for KL-divergence and generalized For example, in the problem that I am currently working on, I am creating On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng notifications@github.com
|
I see the problem with sparse vectors and the KL divergence. I implemented a smoothing operation to approximate KL divergence. Sent from my iPhone
|
QA tests have started for PR 2634 at commit
|
QA tests have finished for PR 2634 at commit
|
I have implemented several variants of Kullback-Leibler divergence in On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notifications@github.com
|
One more thing regarding sparse vectors. Sparse vectors can become dense To address this problem, one can project clusters onto a sparse vector On Sun, Jan 18, 2015 at 7:59 PM, Derrick Burns derrickrburns@gmail.com
|
@derrickburns We use dense vectors to store cluster centers, because the centers are very likely to become dense during aggregation. If there are zeros, they can be efficiently compressed before sending back to the driver. For performance, we never add two sparse vectors together. I'm not sure whether this answers your question. |
In my application (n-gram contexts), the sparse vectors can be of extremely high dimension. To make the problem manageable, I select the k most important dimensions per point. For a cluster of m points, I can have a sparse vector of m*k non-zero values. Since some clusters can become quite large ( O(n) is size), I can get sparse vectors of O(nk) non-zero values. Still, the vector is sparse, since the dimension of the vector is potentially 2^31-1. So, I cannot treat the vector as dense. To deal with the growth problem, I have implemented a centroid that only retains the k most important features. Sent from my iPhone
|
This PR ports the following feature implemented in #2634 by derrickburns: * During k-means|| initialization, we should cache costs (squared distances) previously computed. It also contains the following optimization: * aggregate sumCosts directly * ran multiple (#runs) k-means++ in parallel I compared the performance locally on mnist-digit. Before this patch:  with this patch:  It is clear that each k-means|| iteration takes about the same amount of time with this patch. Authors: Derrick Burns <derrickburns@gmail.com> Xiangrui Meng <meng@databricks.com> Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits: 0a875ec [Xiangrui Meng] address comments 4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means||
Do you mind closing this PR? |
Sure, but I'm traveling now. Would you mind closing it for me? Sent from my iPhone
|
We can't directly but there's an automated process that will eventually. Don't worry about it and/or get to it later. |
Ok thx ! Sent from my iPhone
|
This commit introduces a general distance function trait,
PointOps
, for the Spark K-Means clusterer. There are no public API changes*.Issue - Data Capture Test Fails - NEED HELP
The
org.apache.spark.mllib.clustering.KMeansClusterSuite
"task size should be small in both training and prediction" fails, suggesting that the RDD data is being captured in a closure. This is quite puzzling. My efforts to solve this problem have failed. I need help to solve this problem.Distance Function Trait
The
PointOps
trait defines the distance function. ThePointOps
trait is more than a simple distance function. It also defines the types of Points and Centers for the clusterer. Standard MLLIBVector
s are converted into Points and Centers. In the case of theFastEuclideanOps
implementation ofPointOps
, the Point and Center types includes vector norm members. In other distance functions such as the Kullback-Leibler distance function, the Point and Center types include different values that speed up the distance calculation in a similar way that caching vector norms speeds up the Euclidean distance function. This addresses SPARK-3219.Refactoring
To understand this original code, I found it useful to refactor the original implementation into components. You may find it helpful to understand this pull request by looking at the new components and comparing them to their original implementation. Unfortunately, GitHub diff does not help very much with this.
This commit splits up the clusterer into a number of components which behave (largely) like their predecessors.
KMeansParallel
implements the K-Means || initialization algorithm.KMeansRandom
implements the K-Means Random initialization algorithm.MultiKMeans
implements the K-Means algorithm on multiple sets of cluster centers using a given distance function. Traits for the initializer,KMeansInitializer
, and the general K-Means clusterer,MultiKMeansClusterer
, are provided to highlight the salient interfaces with the intent that alternative implementations of these interfaces may be provided in the future.Performance
This pull request is not focused on performance. Nevertheless, the performance of the KMeans++ implementation was dramatically improved by NOT recomputing distances to clusters centers that were present in previous steps. This turns a quadratic implementation into a linear one.
Second, the KMeans++ implementation uses the general K-Means clusterer in the final step. This parallelizes a step that was sequential.
Together, these changes address SPARK-3424.
Next Steps
This pull request does not introduce new user-visible changes. The next step is to make different distance functions available through a user-visible API. I will provide other distance functions after this pull request has been accepted. Then, we can decide on an appropriate user-level API to access those functions.
Compatibility
While there are no user-level API changes, the behavior of the clusterer is different on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are not filled with random points in this implementation. The former behavior is undesirable for a number a reasons, not the least of which is that there is no reasonable use for duplicate cluster centers. To accommodate the change in behavior, the test cases were changed accordingly. This addresses SPARK-3261.
Testing
This code has been tested (albeit while packaged outside of Spark) and performance measured on data sets of millions of features each with hundreds of dimensions and on tens of thousands of clusters.