[SPARK-3424][MLLIB] cache point distances during k-means|| init #4144

mengxr · 2015-01-21T20:43:00Z

This PR ports the following feature implemented in #2634 by @derrickburns:

During k-means|| initialization, we should cache costs (squared distances) previously computed.

It also contains the following optimization:

aggregate sumCosts directly
ran multiple (#runs) k-means++ in parallel

I compared the performance locally on mnist-digit. Before this patch:

with this patch:

It is clear that each k-means|| iteration takes about the same amount of time with this patch.

SparkQA · 2015-01-21T20:47:49Z

Test build #25909 has started for PR 4144 at commit 4341bb8.

This patch merges cleanly.

SparkQA · 2015-01-21T21:22:25Z

Test build #25909 has finished for PR 4144 at commit 4341bb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T21:22:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25909/
Test FAILed.

mengxr · 2015-01-21T22:20:55Z

test this please

SparkQA · 2015-01-21T22:22:45Z

Test build #25919 has started for PR 4144 at commit 4341bb8.

This patch merges cleanly.

derrickburns · 2015-01-21T23:06:33Z

The conversion of vectors to dense form will only work if the dimension of the space is small, in which case there was little need to provide vectors in sparse form. Therefore, one should assume that sparse vectors are from a high dimensional space, and should never be converted to a dense representation.

SparkQA · 2015-01-21T23:10:29Z

Test build #25919 has finished for PR 4144 at commit 4341bb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T23:10:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25919/
Test FAILed.

derrickburns · 2015-01-21T23:10:42Z

@mengxr I back-ported your port to my com.massivedatascience.clusterer GitHub project (modulo the conversion of data to dense form). :)

mengxr · 2015-01-21T23:36:25Z

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)

derrickburns · 2015-01-21T23:51:03Z

@mengxr

It looks like the final costs rdd is still persisted in exit from the initialization method.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)

—
Reply to this email directly or view it on GitHub.

derrickburns · 2015-01-21T23:57:53Z

@mengxr

A nit: I'd pull the range creation on line 331 out of the inner loop.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)

—
Reply to this email directly or view it on GitHub.

derrickburns · 2015-01-22T00:17:12Z

@mengxr

FYI, I'm about to work on the performance of clustering millions of sparse vectors of very high dimension particularly when using KL divergence, where smoothing is needed to deal with sparsity.

Sent from my iPhone

On Jan 21, 2015, at 3:37 PM, Xiangrui Meng notifications@github.com wrote:

@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.)

Feel free to port the features and it would be great if you can help test the performance:)

—
Reply to this email directly or view it on GitHub.

mengxr · 2015-01-22T01:05:12Z

@derrickburns Thanks! I've addressed your comments!

SparkQA · 2015-01-22T01:07:43Z

Test build #25938 has started for PR 4144 at commit 0a875ec.

This patch merges cleanly.

SparkQA · 2015-01-22T02:15:53Z

Test build #25938 has finished for PR 4144 at commit 0a875ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-22T02:15:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25938/
Test PASSed.

derrickburns · 2015-01-22T04:19:36Z

@mengxr
You may want to refer to the newer code in GitHub
https://github.com/derrickburns/generalized-kmeans-clustering instead of
the old PR going forward.

I've refactored the PointOps interface since the old PR following your
comments.

I also rewrote the KMeansPlusPlus in a similar fashion as to your changes
to KMeansParallel and support sparse points in high dimension, but the
PointOps redesign is the most important change.

On Wed, Jan 21, 2015 at 6:16 PM, UCB AMPLab notifications@github.com
wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25938/
Test PASSed.

—
Reply to this email directly or view it on GitHub
#4144 (comment).

mengxr · 2015-01-22T05:19:34Z

Sure. Thanks for keeping your implementation updated!

mengxr · 2015-01-22T05:22:56Z

I've merged this into master.

do not re-compute point distances during k-means||

4341bb8

address comments

0a875ec

asfgit closed this in ca7910d Jan 22, 2015

[SPARK-3424][MLLIB] cache point distances during k-means|| init #4144

[SPARK-3424][MLLIB] cache point distances during k-means|| init #4144

Uh oh!

Conversation

mengxr commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

AmplabJenkins commented Jan 21, 2015

Uh oh!

mengxr commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

derrickburns commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

AmplabJenkins commented Jan 21, 2015

Uh oh!

derrickburns commented Jan 21, 2015

Uh oh!

mengxr commented Jan 21, 2015

Uh oh!

derrickburns commented Jan 21, 2015

Uh oh!

derrickburns commented Jan 21, 2015

Uh oh!

derrickburns commented Jan 22, 2015

Uh oh!

mengxr commented Jan 22, 2015

Uh oh!

SparkQA commented Jan 22, 2015

Uh oh!

SparkQA commented Jan 22, 2015

Uh oh!

AmplabJenkins commented Jan 22, 2015

Uh oh!

derrickburns commented Jan 22, 2015

Uh oh!

mengxr commented Jan 22, 2015

Uh oh!

mengxr commented Jan 22, 2015

Uh oh!

Uh oh!