Skip to content

[SPARK-5692] [MLlib] Word2Vec save/load #5291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

MechCoder
Copy link
Contributor

Word2Vec model now supports saving and loading.

a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords"
b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats.

@MechCoder
Copy link
Contributor Author

ping @mengxr @jkbradley

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29473 has started for PR 5291 at commit d17cd8c.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29473 has finished for PR 5291 at commit d17cd8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(word: String, vector: Array[Float])
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29473/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29476 has started for PR 5291 at commit bfe4c39.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29476 has finished for PR 5291 at commit bfe4c39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(word: String, vector: Array[Float])
    • case class CreateStruct(children: Seq[NamedExpression]) extends Expression
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29476/
Test PASSed.

try {
model.save(sc, path)
val sameModel = Word2VecModel.load(sc, path)
assert(sameModel.getVectors.keys === model.getVectors.keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering of keys are not guaranteed. We could test map equals directly.

assert(sameModel.getVectors.mapValues(_.toSeq) === model.getVectors.mapValues(_.toSeq))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Why does this not work for a raw array?

@jkbradley
Copy link
Member

@MechCoder Could you please add a quick description to the PR for record-keeping? Thanks!

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29495 has started for PR 5291 at commit 1142f3a.

@MechCoder
Copy link
Contributor Author

@mengxr I have fixed your comments and @jkbradley I have updated the PR description.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29495 has finished for PR 5291 at commit 1142f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(word: String, vector: Array[Float])
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29495/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Mar 31, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 0e00f12 Mar 31, 2015
@MechCoder MechCoder deleted the spark-5692 branch April 1, 2015 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants