-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3424][MLLIB] cache point distances during k-means|| init #4144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #25909 has started for PR 4144 at commit
|
Test build #25909 has finished for PR 4144 at commit
|
Test FAILed. |
test this please |
Test build #25919 has started for PR 4144 at commit
|
The conversion of vectors to dense form will only work if the dimension of the space is small, in which case there was little need to provide vectors in sparse form. Therefore, one should assume that sparse vectors are from a high dimensional space, and should never be converted to a dense representation. |
Test build #25919 has finished for PR 4144 at commit
|
Test FAILed. |
@mengxr I back-ported your port to my com.massivedatascience.clusterer GitHub project (modulo the conversion of data to dense form). :) |
@derrickburns This PR doesn't handle sparse centers. The dense one should work with feature dimension up to 10m, which may cover many cases already. We can solve that issue in a separate PR. Does the changes in this PR look good to you? (It seems that there is something wrong with Jenkins.) Feel free to port the features and it would be great if you can help test the performance:) |
It looks like the final costs rdd is still persisted in exit from the initialization method. Sent from my iPhone
|
A nit: I'd pull the range creation on line 331 out of the inner loop. Sent from my iPhone
|
FYI, I'm about to work on the performance of clustering millions of sparse vectors of very high dimension particularly when using KL divergence, where smoothing is needed to deal with sparsity. Sent from my iPhone
|
@derrickburns Thanks! I've addressed your comments! |
Test build #25938 has started for PR 4144 at commit
|
Test build #25938 has finished for PR 4144 at commit
|
Test PASSed. |
@mengxr I've refactored the PointOps interface since the old PR following your I also rewrote the KMeansPlusPlus in a similar fashion as to your changes On Wed, Jan 21, 2015 at 6:16 PM, UCB AMPLab notifications@github.com
|
Sure. Thanks for keeping your implementation updated! |
I've merged this into master. |
This PR ports the following feature implemented in #2634 by @derrickburns:
It also contains the following optimization:
I compared the performance locally on mnist-digit. Before this patch:
with this patch:
It is clear that each k-means|| iteration takes about the same amount of time with this patch.