-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-6612] [MLLib] [PySpark] Python KMeans parity #5647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #30809 timed out for PR 5647 at commit |
Could anyone please tell us why this timeout has occurred? |
Jenkins, retest this please |
Test build #30912 has finished for PR 5647 at commit
|
Test build #30986 has finished for PR 5647 at commit
|
@FlytxtRnD Checking in: What's the status of fixing the tests? Thanks! It will be great to get this merged for 1.4 (deadline this Friday!) Btw, the earlier test timeout was from Jenkins having a bunch of issues, but hopefully those are resolved now. |
@jkbradley, we are facing some issues with python 3 support. We are working on it and will fix it asap. |
Test build #31114 has finished for PR 5647 at commit
|
@@ -95,6 +106,13 @@ def predict(self, x): | |||
best_distance = distance | |||
return best | |||
|
|||
def computeCost(self, rdd): | |||
"""Return the K-means cost (sum of squared distances of points to their nearest center) for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python style (newline for long doc strings + shorter doc lines <= 72 chars in length following PEP 8):
"""
Return the K-means cost (sum of squared distances of points to their
nearest center) for this model on the given data.
"""
@FlytxtRnD That tiny issue is the only one I see. After that, I think this PR will be good to go. Thanks! |
Test build #31233 has finished for PR 5647 at commit
|
@@ -40,11 +40,16 @@ class KMeansModel(Saveable, Loader): | |||
|
|||
>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) | |||
>>> model = KMeans.train( | |||
... sc.parallelize(data), 2, maxIterations=10, runs=30, initializationMode="random") | |||
... sc.parallelize(data), 2, maxIterations=10, runs=30, initializationMode="random", | |||
... seed=None, initializationSteps=5, epsilon=1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry about this, but hoped to add 1 more update. Could you please set the seed to some fixed number in this and the other call to train() in this doc test? It should be deterministic for stability. (It looks like this was a problem before your PR too.) Thanks!
After the seed update, it really will be it. Thanks! |
>>> model.k | ||
2 | ||
>>> model.computeCost(sc.parallelize(data)) | ||
2.0000000000000004 | ||
>>> model = KMeans.train(sc.parallelize(data), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley , It seems we are not using this model anywhere. Did you mean to add the seed here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not used anywhere, then you can leave it as is. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley , Shall we remove that line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, let's keep it
Test build #31382 has finished for PR 5647 at commit
|
@jkbradley , could you please check if this is ready to merge? |
LGTM. I'm rerunning tests once more for safety sake and then will merge this. Thank you! |
Test build #758 has finished for PR 5647 at commit
|
unrelated failure... |
Test build #763 has finished for PR 5647 at commit
|
Test build #764 has finished for PR 5647 at commit
|
Test build #31828 has finished for PR 5647 at commit
|
@FlytxtRnD The update confused github. Can you please try closing and re-opening this PR to force github to recompute the diff? Thanks! |
@jkbradley ok, will close and reopen this. Could you please tell us why this happened? |
Hm, that didn't seem to fix it. It looks like you merged this branch with the current master; sometimes, that confuses Github. In general, you shouldn't need to merge with master unless Jenkins posts a notice that the PR can't be merged cleanly. Can you please try rebasing your branch off of the current master? Perhaps that will fix it. |
@jkbradley , we merged it with 'branch-1.4' to get rid of the failed Mima tests in jenkins. |
@jkbradley "Can you please try rebasing your branch off of the current master? Perhaps that will fix it." |
Looking at those logs, I don't think the MIMA tests actually failed. Basing this off of "master" should be fine. Could you please try that? If that does not work, then perhaps you could identify your commit hashes, reset this branch to master, cherry pick your commits, and then force push to your remote branch to update this PR. |
@jkbradley , ok we'll try your steps. |
Test build #31851 has finished for PR 5647 at commit
|
Jenkins, retest this please |
Test build #31865 has finished for PR 5647 at commit
|
The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added. (cherry picked from commit 5995ada) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Merged into master and branch-1.4. Thanks! |
@jkbradley , @mengxr Thanks for the help! |
The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
The following items are added to Python kmeans:
kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k