Skip to content

[SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means Clusterer #2634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

derrickburns
Copy link

This commit introduces a general distance function trait, PointOps, for the Spark K-Means clusterer. There are no public API changes*.

Issue - Data Capture Test Fails - NEED HELP

The org.apache.spark.mllib.clustering.KMeansClusterSuite "task size should be small in both training and prediction" fails, suggesting that the RDD data is being captured in a closure. This is quite puzzling. My efforts to solve this problem have failed. I need help to solve this problem.

Distance Function Trait

The PointOps trait defines the distance function. The PointOps trait is more than a simple distance function. It also defines the types of Points and Centers for the clusterer. Standard MLLIB Vectors are converted into Points and Centers. In the case of the FastEuclideanOps implementation of PointOps, the Point and Center types includes vector norm members. In other distance functions such as the Kullback-Leibler distance function, the Point and Center types include different values that speed up the distance calculation in a similar way that caching vector norms speeds up the Euclidean distance function. This addresses SPARK-3219.

Refactoring

To understand this original code, I found it useful to refactor the original implementation into components. You may find it helpful to understand this pull request by looking at the new components and comparing them to their original implementation. Unfortunately, GitHub diff does not help very much with this.

This commit splits up the clusterer into a number of components which behave (largely) like their predecessors. KMeansParallel implements the K-Means || initialization algorithm. KMeansRandom implements the K-Means Random initialization algorithm. MultiKMeans implements the K-Means algorithm on multiple sets of cluster centers using a given distance function. Traits for the initializer, KMeansInitializer, and the general K-Means clusterer, MultiKMeansClusterer, are provided to highlight the salient interfaces with the intent that alternative implementations of these interfaces may be provided in the future.

Performance

This pull request is not focused on performance. Nevertheless, the performance of the KMeans++ implementation was dramatically improved by NOT recomputing distances to clusters centers that were present in previous steps. This turns a quadratic implementation into a linear one.

Second, the KMeans++ implementation uses the general K-Means clusterer in the final step. This parallelizes a step that was sequential.

Together, these changes address SPARK-3424.

Next Steps

This pull request does not introduce new user-visible changes. The next step is to make different distance functions available through a user-visible API. I will provide other distance functions after this pull request has been accepted. Then, we can decide on an appropriate user-level API to access those functions.

Compatibility

While there are no user-level API changes, the behavior of the clusterer is different on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are not filled with random points in this implementation. The former behavior is undesirable for a number a reasons, not the least of which is that there is no reasonable use for duplicate cluster centers. To accommodate the change in behavior, the test cases were changed accordingly. This addresses SPARK-3261.

Testing

This code has been tested (albeit while packaged outside of Spark) and performance measured on data sets of millions of features each with hundreds of dimensions and on tens of thousands of clusters.

…or the Spark K-Means clusterer. There are no public API changes*.

Distance Function Trait

The PointOps trait defines the distance function. The PointOps trait is more than a simple distance function. It also defines the types of Points and Centers for the clusterer. Standard MLLIB Vectors are converted into Points and Centers. In the case of the FastEuclideanOps implementation of PointOps, the Point and Center types includes vector norm members. In other distance functions such as the Kullback-Leibler distance function, the Point and Center types include different values that speed up the distance calculation in a similar way that caching vector norms speeds up the Euclidean distance function. This addresses SPARK-3219.

Refactoring

To understand this original code, I found it useful to refactor the original implementation into components. You may find it helpful to understand this pull request by looking at the new components and comparing them to their original implementation. Unfortunately, GitHub diff does not help very much with this.

This commit splits up the clusterer into a number of components which behave (largely) like their predecessors. KMeansParallel implements the K-Means || initialization algorithm. KMeansRandom implements the K-Means Random initialization algorithm. MultiKMeans implements the K-Means algorithm on multiple sets of cluster centers using a given distance function. Traits for the initializer, KMeansInitializer, and the general K-Means clusterer, MultiKMeansClusterer, are provided to highlight the salient interfaces with the intent that alternative implementations of these interfaces may be provided in the future.

Performance

This pull request is not focused on performance. Nevertheless, the performance of the KMeans++ implementation was dramatically improved by NOT recomputing distances to clusters centers that were present in previous steps. This turns a quadratic implementation into a linear one.

Second, the KMeans++ implementation uses the general K-Means clusterer in the final step. This parallelizes a step that was sequential.

Together, these changes address SPARK-3424.

Next Steps

This pull request does not introduce new user-visible changes. The next step is to make different distance functions available through a user-visible API. I will provide other distance functions after this pull request has been accepted. Then, we can decide on an appropriate user-level API to access those functions.

Compatibility

While there are no user-level API changes, the behavior of the clusterer is different on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are not filled with random points in this implementation. The former behavior is undesirable for a number a reasons, not the least of which is that there is no reasonable use for duplicate cluster centers. To accommodate the change in behavior, the test cases were changed accordingly. This addresses SPARK-3261.

The private K-Means constructor which was used by some test Java code and one example was replaced with a Scala constructor that is not Java friendly. Since the constructor was not user visible, I simply changed the Java test code and the example to use the higher level interface.

Testing

This code has been tested (albeit while packaged outside of Spark) and performance measured on data sets of millions of features each with hundreds of dimensions and on tens of thousands of clusters.
@derrickburns
Copy link
Author

@mingxr I created a new clean pull request. I still need help to understand/fix a closure that is capturing too much data.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2634 at commit fbfdcd8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2634 at commit fbfdcd8.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@derrickburns
Copy link
Author

@mengxr

This is as expected. I need help in solving the closure data capture problem.

Sent from my iPhone

On Oct 2, 2014, at 2:10 PM, Apache Spark QA notifications@github.com wrote:

QA tests have finished for PR 2634 at commit fbfdcd8.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

Reply to this email directly or view it on GitHub.

@mengxr
Copy link
Contributor

mengxr commented Oct 3, 2014

@derrickburns The *ClusterSuite was created to prevent referencing unnecessary objects into the task closure. You can try to remove Serializable from algorithms. While the models are serializable, the algorithm instances should stay on the driver node. If you want to use a member method in a task closure, either make it static or define it as a local method. If you want to use a member variable, assign it to a val first.

This is something we can try. Avoiding serializing unnecessary objects is a good practice, but I'm not sure whether it is worth the effort.

Btw, could you update your PR following the https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? Thanks!

@derrickburns
Copy link
Author

I ran the style tests. They pass. Is there something else in the style guide that is not captured in the tests ?

I have expended much effort to avoid serializing unnecessary objects. I'm still perplexed why so much data is being captured in the closure that the test fails.

Anyway, what are the next steps? Omit the test and Approve the PR? Ask someone to help fix the code to avoid the unit test failure ?

Thx !

Sent from my iPhone

On Oct 3, 2014, at 12:17 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns The *ClusterSuite was created to prevent referencing unnecessary objects into the task closure. You can try to remove Serializable from algorithms. While the models are serializable, the algorithm instances should stay on the driver node. If you want to use a member method in a task closure, either make it static or define it as a local method. If you want to use a member variable, assign it to a val first.

This is something we can try. Avoiding serializing unnecessary objects is a good practice, but I'm not sure whether it is worth the effort.

Btw, could you update your PR following the https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? Thanks!


Reply to this email directly or view it on GitHub.

@mengxr
Copy link
Contributor

mengxr commented Oct 6, 2014

@derrickburns The style test doesn't capture all, unfortunately. The Spark Code Style Guide is the first place to check. I will mark a few examples inline.

I think the best way to trace down the problem is to split this PR into small ones and see whether we can find it. Is there a good way to split this? Thanks!

* A clustering model for K-means. Each point belongs to the cluster with the closest center.
*/
private[mllib] class GeneralizedKMeansModel[P<:FP, C<:FP](
val pointOps: PointOps[P, C],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4-space indentation

@mengxr
Copy link
Contributor

mengxr commented Oct 6, 2014

@derrickburns I marked a few style problems (not all of them). There are breaking changes in your PR, which we should avoid as much as possible. Even we want to remove some methods, we should deprecate them first and remove them in a later release.

For the serialization problem, I'm not sure whether I have time looking into it this week. It would be nice to split this into small PRs, so we can trace down the problem faster.

@derrickburns
Copy link
Author

I know exactly which closure is exceeding the size limitation. The problem,
is that I cannot see how to make the closure capture less data!

On Mon, Oct 6, 2014 at 2:42 PM, Xiangrui Meng notifications@github.com
wrote:

@derrickburns https://github.com/derrickburns The style test doesn't
capture all, unfortunately. The Spark Code Style Guide is the first place
to check. I will mark a few examples inline.

I think the best way to trace down the problem is to split this PR into
small ones and see whether we can find it. Is there a good way to split
this? Thanks!


Reply to this email directly or view it on GitHub
#2634 (comment).

@derrickburns
Copy link
Author

@mengxr

Is there an IntelliJ or Eclipse configuration that i can use to reformat
the code according to the guidelines?

The breaking change in the PR is for a private constructor that is,
therefore, not used outside of Spark. Do these need to be deprecated first
as well?

On Mon, Oct 6, 2014 at 2:53 PM, Xiangrui Meng notifications@github.com
wrote:

@derrickburns https://github.com/derrickburns I marked a few style
problems (not all of them). There are breaking changes in your PR, which we
should avoid as much as possible. Even we want to remove some methods, we
should deprecate them first and remove them in a later release.

For the serialization problem, I'm not sure whether I have time looking
into it this week. It would be nice to split this into small PRs, so we can
trace down the problem faster.


Reply to this email directly or view it on GitHub
#2634 (comment).

@derrickburns
Copy link
Author

That would be great!

On Sat, Dec 27, 2014 at 12:59 PM, Nicholas Chammas <notifications@github.com

wrote:

@mengxr https://github.com/mengxr Now that 1.2.0 is out, can we
schedule a rough timeframe for reviewing this patch?


Reply to this email directly or view it on GitHub
#2634 (comment).

@mengxr
Copy link
Contributor

mengxr commented Dec 30, 2014

@derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after?

@derrickburns
Copy link
Author

No problem. Thx!

Sent from my iPhone

On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after?


Reply to this email directly or view it on GitHub.

@derrickburns
Copy link
Author

Thx for taking this on.

Sent from my iPhone

On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after?


Reply to this email directly or view it on GitHub.

@mengxr
Copy link
Contributor

mengxr commented Jan 5, 2015

@derrickburns I tried to merge this PR with the current master. But there are many conflicts, majorly because we removed breeze from the implementation. To merge the changes, I think we should make separate PRs for the JIRAs mentioned here, from fixes to new features. Do you mind me splitting this PR? Thanks!

@derrickburns
Copy link
Author

I've said this before, so please forgive me for being repetitive.

The new implementation is a rewrite, not a patch, so it is not possible to
parcel out the changes into distinct PRs.

On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notifications@github.com
wrote:

@derrickburns https://github.com/derrickburns I tried to merge this PR
with the current master. But there are many conflicts, majorly because we
removed breeze from the implementation. To merge the changes, I think we
should make separate PRs for the JIRAs mentioned here, from fixes to new
features. Do you mind me splitting this PR? Thanks!


Reply to this email directly or view it on GitHub
#2634 (comment).

@derrickburns
Copy link
Author

The pull request that you integrated on December 3 is redundant to this
one. Therefore you need not worry about merging in those changes. Simply
select this version.

Are you willing integrate this pull request as is?

On Mon, Jan 5, 2015 at 4:01 PM, Derrick Burns derrickrburns@gmail.com
wrote:

I've said this before, so please forgive me for being repetitive.

The new implementation is a rewrite, not a patch, so it is not possible to
parcel out the changes into distinct PRs.

On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notifications@github.com
wrote:

@derrickburns https://github.com/derrickburns I tried to merge this PR
with the current master. But there are many conflicts, majorly because we
removed breeze from the implementation. To merge the changes, I think we
should make separate PRs for the JIRAs mentioned here, from fixes to new
features. Do you mind me splitting this PR? Thanks!


Reply to this email directly or view it on GitHub
#2634 (comment).

@srowen
Copy link
Member

srowen commented Jan 7, 2015

(@derrickburns this PR can't be merged as-is. It contains merge conflicts with master now, as Github notes here. You would need to rebase it.)

@mengxr
Copy link
Contributor

mengxr commented Jan 8, 2015

@derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (60000x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes.

  1. We replaced breeze operations by our own implementation. The latter is about 2-3x faster.
  2. Running k-means++ distributively has noticeable overhead with small k and feature dimension.

I think it is still feasible to include features through separate PRs:

  1. remember previously computed best distances in k-means++ initialization
  2. allow fixing the random seed (addressed in [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed #3610)
  3. variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points.
  4. parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison.
  5. support Bregman divergences

Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind.

For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors:

d(x, y) = f(x)  - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x, g(y)>

where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached, and <x, g(y)> can take advantage of sparse x. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse.

Besides those comments, I'm going to make some minor comments inline.

* limitations under the License.
*/

package org.apache.spark.mllib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be mllib.clustering as the file is under clustering/?

@derrickburns
Copy link
Author

Thanks for the information on the speedup that you obtained by eliminating
Breeze. I was unaware that the performance is so poor. To what do you
attribute the poor performance of Breeze?

One case certainly split out 1-4 AFTER doing 5. My point was that doing 5
(supporting Bregman divergences) requires touching practically every method
in the clustering package.

  1. Supporting a variable number of clusters requires a significant rewrite
    at well, since the assumption that there are K clusters is made repeatedly
    in the code. If seems like adding a way to split clusters introduces a
    whole other dimension of flexibility -- which I am not against -- that
    would complicate the interface. For example, what heuristic would you use
    to identify a cluster to split and how would you split it. It seems to me
    that once you go down that path, you are headed toward a divisive
    clustering algorithm.
  2. As for the overhead of running k-means++ distributively, will someone
    use Spark for small k and feature dimension?
  3. Bregman divergences

As it turns out, I now have a need to perform clustering on sparse data, so
I am making a pass over my private implementation to 1) eliminate the use
of Breeze and 2) use the Spark Vector classes. When I am done, I can
include these modification in an updated pull request.

I do not understand you conclusion that for KL-divergence and generalized
I-divergence the points cannot be sparse. The domain need not be R^d+.
Rather, the domain can be R^infinity+.

For example, in the problem that I am currently working on, I am creating
sparse vectors where the vectors represent the frequency of occurrence of
the contexts for given n-grams. The space of the contexts is extremely
large, however, I am only interested in the M most frequent contexts per
n-gram. I use a sketch to identify the M most frequent contexts per
n-gram. This gives me an M length SparseVector of non-negative frequency
values. Thus, for each n-gram, I will have a SparseVector that represents
the distribution of the more frequently occurring contexts. I do not see a
problem using this SparseVector with the KL-divergence metric. Am I missing
something?

On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng notifications@github.com
wrote:

@derrickburns https://github.com/derrickburns I like the improvements
implemented in this PR. But as @srowen https://github.com/srowen
mentioned, we have to resolve conflicts with the master branch before we
can merge any PR. I compared the performance of this PR with master on
minist-digits (60000x784, sparse, 10 clusters) locally and found the master
runs 2-3x faster. I guess this is majorly caused by two changes.

  1. We replaced breeze operations by our own implementation. The latter
    is about 2-3x faster.
  2. Running k-means++ distributively has noticeable overhead with small
    k and feature dimension.

I think it is still feasible to include features through separate PRs:

  1. remember previously computed best distances in k-means++
    initialization
  2. allow fixing the random seed (addressed in [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed #3610
    [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed #3610)
  3. variable number of clusters. We should discuss whether we want to
    have less than k clusters or split the biggest one if there are more than k
    points.
  4. parallelize k-means++. I think whether we should replace local
    k-means++ or make it configurable requires some discussion and performance
    comparison.
  5. support Bregman divergences

Putting all of them together would certainly delay the review process and
require resolving conflicts. I may have some time to prepare PRs for some
of the features here, if you don't mind.

For Bregman divergences, I'm thinking we can alter the formulation to
support sparse vectors:

d(x, y) = f(x) - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x, g(y)>

where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached,
and <x, g(y)> can take advantage of sparse x. But I'm not sure whether
this formulation is really useful on any Bregman divergence rather than the
squared distance and the Mahalanobis distance. For KL-divergence and
generalized I-divergence, the domain is R^d_+ and hence the points cannot
be sparse.

Besides those comments, I'm going to make some minor comments inline.


Reply to this email directly or view it on GitHub
#2634 (comment).

@derrickburns
Copy link
Author

I see the problem with sparse vectors and the KL divergence.

I implemented a smoothing operation to approximate KL divergence.

Sent from my iPhone

On Jan 8, 2015, at 3:55 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (60000x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes.

We replaced breeze operations by our own implementation. The latter is about 2-3x faster.
Running k-means++ distributively has noticeable overhead with small k and feature dimension.
I think it is still feasible to include features through separate PRs:

remember previously computed best distances in k-means++ initialization
allow fixing the random seed (addressed in #3610)
variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points.
parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison.
support Bregman divergences
Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind.

For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors:

d(x, y) = f(x) - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x, g(y)>
where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached, and <x, g(y)> can take advantage of sparse x. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse.

Besides those comments, I'm going to make some minor comments inline.


Reply to this email directly or view it on GitHub.

@SparkQA
Copy link

SparkQA commented Jan 18, 2015

QA tests have started for PR 2634 at commit 35da8e9.

  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Jan 18, 2015

QA tests have finished for PR 2634 at commit 35da8e9.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

@derrickburns
Copy link
Author

@mengxr

I have implemented several variants of Kullback-Leibler divergence in
my separate
GitHub repository
https://github.com/derrickburns/generalized-kmeans-clustering. These
variants are more efficient that the standard KL-divergence which is
defined on R+ ^ n because they take advantage of extra knowledge of the
domain. I have used these variants with much success (i.e. much faster
running time) in my large scale clustering runs.

On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notifications@github.com
wrote:

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/
Test FAILed.


Reply to this email directly or view it on GitHub
#2634 (comment).

@derrickburns
Copy link
Author

@mengxr

One more thing regarding sparse vectors. Sparse vectors can become dense
under cluster creation, which, in turn, can cause the running time of the
K-means clustering to skyrocket.

To address this problem, one can project clusters onto a sparse vector
before performing distance calculation. My current version of the
clusterer does this when the appropriate distance object is selected.

On Sun, Jan 18, 2015 at 7:59 PM, Derrick Burns derrickrburns@gmail.com
wrote:

@mengxr

I have implemented several variants of Kullback-Leibler divergence in my separate
GitHub repository
https://github.com/derrickburns/generalized-kmeans-clustering. These
variants are more efficient that the standard KL-divergence which is
defined on R+ ^ n because they take advantage of extra knowledge of the
domain. I have used these variants with much success (i.e. much faster
running time) in my large scale clustering runs.

On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notifications@github.com
wrote:

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/
Test FAILed.


Reply to this email directly or view it on GitHub
#2634 (comment).

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2015

@derrickburns We use dense vectors to store cluster centers, because the centers are very likely to become dense during aggregation. If there are zeros, they can be efficiently compressed before sending back to the driver. For performance, we never add two sparse vectors together. I'm not sure whether this answers your question.

@derrickburns
Copy link
Author

In my application (n-gram contexts), the sparse vectors can be of extremely high dimension. To make the problem manageable, I select the k most important dimensions per point. For a cluster of m points, I can have a sparse vector of m*k non-zero values. Since some clusters can become quite large ( O(n) is size), I can get sparse vectors of O(nk) non-zero values. Still, the vector is sparse, since the dimension of the vector is potentially 2^31-1.

So, I cannot treat the vector as dense.

To deal with the growth problem, I have implemented a centroid that only retains the k most important features.

Sent from my iPhone

On Jan 19, 2015, at 5:07 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns We use dense vectors to store cluster centers, because the centers are very likely to become dense during aggregation. If there are zeros, they can be efficiently compressed before sending back to the driver. For performance, we never add two sparse vectors together. I'm not sure whether this answers your question.


Reply to this email directly or view it on GitHub.

asfgit pushed a commit that referenced this pull request Jan 22, 2015
This PR ports the following feature implemented in #2634 by derrickburns:

* During k-means|| initialization, we should cache costs (squared distances) previously computed.

It also contains the following optimization:

* aggregate sumCosts directly
* ran multiple (#runs) k-means++ in parallel

I compared the performance locally on mnist-digit. Before this patch:

![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png)

with this patch:

![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png)

It is clear that each k-means|| iteration takes about the same amount of time with this patch.

Authors:
  Derrick Burns <derrickburns@gmail.com>
  Xiangrui Meng <meng@databricks.com>

Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits:

0a875ec [Xiangrui Meng] address comments
4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means||
@srowen
Copy link
Member

srowen commented May 18, 2015

Do you mind closing this PR?

@derrickburns
Copy link
Author

Sure, but I'm traveling now. Would you mind closing it for me?

Sent from my iPhone

On May 18, 2015, at 8:19 PM, Sean Owen notifications@github.com wrote:

Do you mind closing this PR?


Reply to this email directly or view it on GitHub.

@srowen
Copy link
Member

srowen commented May 18, 2015

We can't directly but there's an automated process that will eventually. Don't worry about it and/or get to it later.

@derrickburns
Copy link
Author

Ok thx !

Sent from my iPhone

On May 18, 2015, at 8:47 PM, Sean Owen notifications@github.com wrote:

We can't directly but there's an automated process that will eventually. Don't worry about it and/or get to it later.


Reply to this email directly or view it on GitHub.

@asfgit asfgit closed this in 6916533 Jul 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants