Skip to content

[SPARK-3424][MLLIB] cache point distances during k-means|| init #4144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Jan 21, 2015

This PR ports the following feature implemented in #2634 by @derrickburns:

  • During k-means|| initialization, we should cache costs (squared distances) previously computed.

It also contains the following optimization:

  • aggregate sumCosts directly
  • ran multiple (#runs) k-means++ in parallel

I compared the performance locally on mnist-digit. Before this patch:

before

with this patch:

after

It is clear that each k-means|| iteration takes about the same amount of time with this patch.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25909 has started for PR 4144 at commit 4341bb8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25909 has finished for PR 4144 at commit 4341bb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25909/
Test FAILed.

@mengxr
Copy link
Contributor Author

mengxr commented Jan 21, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25919 has started for PR 4144 at commit 4341bb8.

  • This patch merges cleanly.

@derrickburns
Copy link

The conversion of vectors to dense form will only work if the dimension of the space is small, in which case there was little need to provide vectors in sparse form. Therefore, one should assume that sparse vectors are from a high dimensional space, and should never be converted to a dense representation.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25919 has finished for PR 4144 at commit 4341bb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25919/
Test FAILed.

@derrickburns
Copy link

@mengxr I back-ported your port to my com.massivedatascience.clusterer GitHub project (modulo the conversion of data to dense form). :)

@mengxr
Copy link
Contributor Author

mengxr commented Jan 21, 2015

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)

@derrickburns
Copy link

@mengxr

It looks like the final costs rdd is still persisted in exit from the initialization method.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)


Reply to this email directly or view it on GitHub.

@derrickburns
Copy link

@mengxr

A nit: I'd pull the range creation on line 331 out of the inner loop.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)


Reply to this email directly or view it on GitHub.

@derrickburns
Copy link

@mengxr

FYI, I'm about to work on the performance of clustering millions of sparse vectors of very high dimension particularly when using KL divergence, where smoothing is needed to deal with sparsity.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)


Reply to this email directly or view it on GitHub.

@mengxr
Copy link
Contributor Author

mengxr commented Jan 22, 2015

@derrickburns Thanks! I've addressed your comments!

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25938 has started for PR 4144 at commit 0a875ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25938 has finished for PR 4144 at commit 0a875ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25938/
Test PASSed.

@derrickburns
Copy link

@mengxr
You may want to refer to the newer code in GitHub
https://github.com/derrickburns/generalized-kmeans-clustering instead of
the old PR going forward.

I've refactored the PointOps interface since the old PR following your
comments.

I also rewrote the KMeansPlusPlus in a similar fashion as to your changes
to KMeansParallel and support sparse points in high dimension, but the
PointOps redesign is the most important change.

On Wed, Jan 21, 2015 at 6:16 PM, UCB AMPLab notifications@github.com
wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25938/
Test PASSed.


Reply to this email directly or view it on GitHub
#4144 (comment).

@mengxr
Copy link
Contributor Author

mengxr commented Jan 22, 2015

Sure. Thanks for keeping your implementation updated!

@mengxr
Copy link
Contributor Author

mengxr commented Jan 22, 2015

I've merged this into master.

@asfgit asfgit closed this in ca7910d Jan 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants