Skip to content

[SPARK-6612] [MLLib] [PySpark] Python KMeans parity #5647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

FlytxtRnD
Copy link
Contributor

The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30809 timed out for PR 5647 at commit 9903837 after a configured wait of 150m.

@FlytxtRnD
Copy link
Contributor Author

Could anyone please tell us why this timeout has occurred?

@FlytxtRnD
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30912 has finished for PR 5647 at commit 9903837.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30986 has finished for PR 5647 at commit 9903837.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch removes the following dependencies:
    • RoaringBitmap-0.4.5.jar
    • activation-1.1.jar
    • akka-actor_2.10-2.3.4-spark.jar
    • akka-remote_2.10-2.3.4-spark.jar
    • akka-slf4j_2.10-2.3.4-spark.jar
    • aopalliance-1.0.jar
    • arpack_combined_all-0.1.jar
    • avro-1.7.7.jar
    • breeze-macros_2.10-0.11.2.jar
    • breeze_2.10-0.11.2.jar
    • chill-java-0.5.0.jar
    • chill_2.10-0.5.0.jar
    • commons-beanutils-1.7.0.jar
    • commons-beanutils-core-1.8.0.jar
    • commons-cli-1.2.jar
    • commons-codec-1.10.jar
    • commons-collections-3.2.1.jar
    • commons-compress-1.4.1.jar
    • commons-configuration-1.6.jar
    • commons-digester-1.8.jar
    • commons-httpclient-3.1.jar
    • commons-io-2.1.jar
    • commons-lang-2.5.jar
    • commons-lang3-3.3.2.jar
    • commons-math-2.1.jar
    • commons-math3-3.1.1.jar
    • commons-net-2.2.jar
    • compress-lzf-1.0.0.jar
    • config-1.2.1.jar
    • core-1.1.2.jar
    • curator-client-2.4.0.jar
    • curator-framework-2.4.0.jar
    • curator-recipes-2.4.0.jar
    • gmbal-api-only-3.0.0-b023.jar
    • grizzly-framework-2.1.2.jar
    • grizzly-http-2.1.2.jar
    • grizzly-http-server-2.1.2.jar
    • grizzly-http-servlet-2.1.2.jar
    • grizzly-rcm-2.1.2.jar
    • groovy-all-2.3.7.jar
    • guava-14.0.1.jar
    • guice-3.0.jar
    • hadoop-annotations-2.2.0.jar
    • hadoop-auth-2.2.0.jar
    • hadoop-client-2.2.0.jar
    • hadoop-common-2.2.0.jar
    • hadoop-hdfs-2.2.0.jar
    • hadoop-mapreduce-client-app-2.2.0.jar
    • hadoop-mapreduce-client-common-2.2.0.jar
    • hadoop-mapreduce-client-core-2.2.0.jar
    • hadoop-mapreduce-client-jobclient-2.2.0.jar
    • hadoop-mapreduce-client-shuffle-2.2.0.jar
    • hadoop-yarn-api-2.2.0.jar
    • hadoop-yarn-client-2.2.0.jar
    • hadoop-yarn-common-2.2.0.jar
    • hadoop-yarn-server-common-2.2.0.jar
    • ivy-2.4.0.jar
    • jackson-annotations-2.4.0.jar
    • jackson-core-2.4.4.jar
    • jackson-core-asl-1.8.8.jar
    • jackson-databind-2.4.4.jar
    • jackson-jaxrs-1.8.8.jar
    • jackson-mapper-asl-1.8.8.jar
    • jackson-module-scala_2.10-2.4.4.jar
    • jackson-xc-1.8.8.jar
    • jansi-1.4.jar
    • javax.inject-1.jar
    • javax.servlet-3.0.0.v201112011016.jar
    • javax.servlet-3.1.jar
    • javax.servlet-api-3.0.1.jar
    • jaxb-api-2.2.2.jar
    • jaxb-impl-2.2.3-1.jar
    • jcl-over-slf4j-1.7.10.jar
    • jersey-client-1.9.jar
    • jersey-core-1.9.jar
    • jersey-grizzly2-1.9.jar
    • jersey-guice-1.9.jar
    • jersey-json-1.9.jar
    • jersey-server-1.9.jar
    • jersey-test-framework-core-1.9.jar
    • jersey-test-framework-grizzly2-1.9.jar
    • jets3t-0.7.1.jar
    • jettison-1.1.jar
    • jetty-util-6.1.26.jar
    • jline-0.9.94.jar
    • jline-2.10.4.jar
    • jodd-core-3.6.3.jar
    • json4s-ast_2.10-3.2.10.jar
    • json4s-core_2.10-3.2.10.jar
    • json4s-jackson_2.10-3.2.10.jar
    • jsr305-1.3.9.jar
    • jtransforms-2.4.0.jar
    • jul-to-slf4j-1.7.10.jar
    • kryo-2.21.jar
    • log4j-1.2.17.jar
    • lz4-1.2.0.jar
    • management-api-3.0.0-b012.jar
    • mesos-0.21.0-shaded-protobuf.jar
    • metrics-core-3.1.0.jar
    • metrics-graphite-3.1.0.jar
    • metrics-json-3.1.0.jar
    • metrics-jvm-3.1.0.jar
    • minlog-1.2.jar
    • netty-3.8.0.Final.jar
    • netty-all-4.0.23.Final.jar
    • objenesis-1.2.jar
    • opencsv-2.3.jar
    • oro-2.0.8.jar
    • paranamer-2.6.jar
    • parquet-column-1.6.0rc3.jar
    • parquet-common-1.6.0rc3.jar
    • parquet-encoding-1.6.0rc3.jar
    • parquet-format-2.2.0-rc1.jar
    • parquet-generator-1.6.0rc3.jar
    • parquet-hadoop-1.6.0rc3.jar
    • parquet-jackson-1.6.0rc3.jar
    • protobuf-java-2.4.1.jar
    • protobuf-java-2.5.0-spark.jar
    • py4j-0.8.2.1.jar
    • pyrolite-2.0.1.jar
    • quasiquotes_2.10-2.0.1.jar
    • reflectasm-1.07-shaded.jar
    • scala-compiler-2.10.4.jar
    • scala-library-2.10.4.jar
    • scala-reflect-2.10.4.jar
    • scalap-2.10.4.jar
    • scalatest_2.10-2.2.1.jar
    • slf4j-api-1.7.10.jar
    • slf4j-log4j12-1.7.10.jar
    • snappy-java-1.1.1.6.jar
    • spark-bagel_2.10-1.4.0-SNAPSHOT.jar
    • spark-catalyst_2.10-1.4.0-SNAPSHOT.jar
    • spark-core_2.10-1.4.0-SNAPSHOT.jar
    • spark-graphx_2.10-1.4.0-SNAPSHOT.jar
    • spark-launcher_2.10-1.4.0-SNAPSHOT.jar
    • spark-mllib_2.10-1.4.0-SNAPSHOT.jar
    • spark-network-common_2.10-1.4.0-SNAPSHOT.jar
    • spark-network-shuffle_2.10-1.4.0-SNAPSHOT.jar
    • spark-repl_2.10-1.4.0-SNAPSHOT.jar
    • spark-sql_2.10-1.4.0-SNAPSHOT.jar
    • spark-streaming_2.10-1.4.0-SNAPSHOT.jar
    • spire-macros_2.10-0.7.4.jar
    • spire_2.10-0.7.4.jar
    • stax-api-1.0.1.jar
    • stream-2.7.0.jar
    • tachyon-0.5.0.jar
    • tachyon-client-0.5.0.jar
    • uncommons-maths-1.2.2a.jar
    • unused-1.0.0.jar
    • xmlenc-0.52.jar
    • xz-1.0.jar
    • zookeeper-3.4.5.jar

@jkbradley
Copy link
Member

@FlytxtRnD Checking in: What's the status of fixing the tests? Thanks! It will be great to get this merged for 1.4 (deadline this Friday!)

Btw, the earlier test timeout was from Jenkins having a bunch of issues, but hopefully those are resolved now.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley, we are facing some issues with python 3 support. We are working on it and will fix it asap.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31114 has finished for PR 5647 at commit 7ecfd00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -95,6 +106,13 @@ def predict(self, x):
best_distance = distance
return best

def computeCost(self, rdd):
"""Return the K-means cost (sum of squared distances of points to their nearest center) for this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python style (newline for long doc strings + shorter doc lines <= 72 chars in length following PEP 8):

"""
Return the K-means cost (sum of squared distances of points to their
nearest center) for this model on the given data.
"""

@jkbradley
Copy link
Member

@FlytxtRnD That tiny issue is the only one I see. After that, I think this PR will be good to go. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31233 has finished for PR 5647 at commit d6d3a09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -40,11 +40,16 @@ class KMeansModel(Saveable, Loader):

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> model = KMeans.train(
... sc.parallelize(data), 2, maxIterations=10, runs=30, initializationMode="random")
... sc.parallelize(data), 2, maxIterations=10, runs=30, initializationMode="random",
... seed=None, initializationSteps=5, epsilon=1e-4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry about this, but hoped to add 1 more update. Could you please set the seed to some fixed number in this and the other call to train() in this doc test? It should be deterministic for stability. (It looks like this was a problem before your PR too.) Thanks!

@jkbradley
Copy link
Member

After the seed update, it really will be it. Thanks!

>>> model.k
2
>>> model.computeCost(sc.parallelize(data))
2.0000000000000004
>>> model = KMeans.train(sc.parallelize(data), 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley , It seems we are not using this model anywhere. Did you mean to add the seed here too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used anywhere, then you can leave it as is. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley , Shall we remove that line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's keep it

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31382 has finished for PR 5647 at commit 0319821.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley , could you please check if this is ready to merge?

@jkbradley
Copy link
Member

LGTM. I'm rerunning tests once more for safety sake and then will merge this. Thank you!

@SparkQA
Copy link

SparkQA commented May 4, 2015

Test build #758 has finished for PR 5647 at commit 0319821.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

unrelated failure...

@SparkQA
Copy link

SparkQA commented May 4, 2015

Test build #763 has finished for PR 5647 at commit 0319821.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #764 has finished for PR 5647 at commit 0319821.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31828 has finished for PR 5647 at commit 8aac002.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol
    • case class Not(child: Expression) extends UnaryExpression with Predicate with ExpectsInputTypes
    • case class And(left: Expression, right: Expression)
    • case class Or(left: Expression, right: Expression)
    • abstract class BinaryComparison extends BinaryExpression with Predicate
    • trait StringRegexExpression extends ExpectsInputTypes
    • trait CaseConversionExpression extends ExpectsInputTypes
    • case class Substring(str: Expression, pos: Expression, len: Expression)

@jkbradley
Copy link
Member

@FlytxtRnD The update confused github. Can you please try closing and re-opening this PR to force github to recompute the diff? Thanks!

@FlytxtRnD
Copy link
Contributor Author

@jkbradley ok, will close and reopen this. Could you please tell us why this happened?

@FlytxtRnD FlytxtRnD closed this May 5, 2015
@FlytxtRnD FlytxtRnD reopened this May 5, 2015
@jkbradley
Copy link
Member

Hm, that didn't seem to fix it. It looks like you merged this branch with the current master; sometimes, that confuses Github. In general, you shouldn't need to merge with master unless Jenkins posts a notice that the PR can't be merged cleanly.

Can you please try rebasing your branch off of the current master? Perhaps that will fix it.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley , we merged it with 'branch-1.4' to get rid of the failed Mima tests in jenkins.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley "Can you please try rebasing your branch off of the current master? Perhaps that will fix it."
We didn't get your point.. To which branch should we rebase?

@jkbradley
Copy link
Member

Looking at those logs, I don't think the MIMA tests actually failed. Basing this off of "master" should be fine. Could you please try that?

If that does not work, then perhaps you could identify your commit hashes, reset this branch to master, cherry pick your commits, and then force push to your remote branch to update this PR.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley , ok we'll try your steps.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31851 has finished for PR 5647 at commit b9e451b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FlytxtRnD
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31865 has finished for PR 5647 at commit b9e451b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class ShuffleHandle(val shuffleId: Int) extends Serializable
    • * class SomethingNotSerializable
    • logDebug(s" + cloning the object $obj of class $
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer
    • class SaslEncryption
    • static class EncryptedMessage extends AbstractReferenceCounted implements FileRegion
    • class SaslRpcHandler extends RpcHandler
    • public class SaslServerBootstrap implements TransportServerBootstrap
    • public class SparkSaslClient implements SaslEncryptionBackend
    • public class SparkSaslServer implements SaslEncryptionBackend
    • public class ByteArrayWritableChannel implements WritableByteChannel
    • class ParamGridBuilder(object):
    • abstract class Dialect
    • class DialectException(msg: String, cause: Throwable) extends Exception(msg, cause)
    • case class Not(child: Expression) extends UnaryExpression with Predicate with ExpectsInputTypes
    • case class And(left: Expression, right: Expression)
    • case class Or(left: Expression, right: Expression)
    • abstract class BinaryComparison extends BinaryExpression with Predicate
    • trait StringRegexExpression extends ExpectsInputTypes
    • trait CaseConversionExpression extends ExpectsInputTypes
    • case class Substring(str: Expression, pos: Expression, len: Expression)
    • case class HiveDatabase(
    • abstract class TableType
    • case class HiveStorageDescriptor(
    • case class HivePartition(
    • case class HiveColumn(name: String, hiveType: String, comment: String)
    • case class HiveTable(
    • trait ClientInterface
    • class ClientWrapper(
    • class IsolatedClientLoader(
    • protected trait ReflectionMagic
    • protected implicit class InstanceMagic(a: Any)
    • protected implicit class StaticMagic(c: Class[_])

asfgit pushed a commit that referenced this pull request May 5, 2015
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.

(cherry picked from commit 5995ada)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 5995ada May 5, 2015
@mengxr
Copy link
Contributor

mengxr commented May 5, 2015

Merged into master and branch-1.4. Thanks!

@FlytxtRnD
Copy link
Contributor Author

@jkbradley , @mengxr Thanks for the help!

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes apache#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants