Skip to content

[SPARK-10809] [MLlib] Single-document topicDistributions method for LocalLDAModel #9484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Nov 5, 2015

jira: https://issues.apache.org/jira/browse/SPARK-10809

We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.

add some missing assert too.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45087 has finished for PR 9484 at commit cb5d823.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @return (document ID, topic mixture distribution for document)
*/
@Since("1.6.0")
def topicDistributions(document: (Long, Vector)): (Long, Vector) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove the doc ID? It's not necessary for a single doc, and removing it will make this more Java-friendly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45202 has finished for PR 9484 at commit a175ab1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 11, 2015
This adds LDA to spark.ml, the Pipelines API.  It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs.  These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR.

CC: hhbyyh feynmanliang  If you have a chance to make a pass, that'd be really helpful--thanks!  Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9513 from jkbradley/lda-pipelines.

(cherry picked from commit e281b87)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 11, 2015
This adds LDA to spark.ml, the Pipelines API.  It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs.  These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR.

CC: hhbyyh feynmanliang  If you have a chance to make a pass, that'd be really helpful--thanks!  Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9513 from jkbradley/lda-pipelines.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
This adds LDA to spark.ml, the Pipelines API.  It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs.  These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [apache/spark#9484], but I'll try to merge [apache/spark#9484] first and then rebase this PR.

CC: hhbyyh feynmanliang  If you have a chance to make a pass, that'd be really helpful--thanks!  Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9513 from jkbradley/lda-pipelines.
* to [[topicDistributions(documents: RDD[(Long, Vector)])]] to avoid overhead.
*
* @param document document to predict topic mixture distributions for
* @return (document ID, topic mixture distribution for document)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this line (no doc ID)

@jkbradley
Copy link
Member

@hhbyyh Sorry again for the delay, but we can get this merged now

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 7, 2016

@jkbradley It's quite all right. Thanks for reviewing. Update sent.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48895 has finished for PR 9484 at commit 9204462.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* literature). Returns a vector of zeros for an empty document.
*
* Note this means to allow quick query for single document. For batch documents, please refer
* to [[topicDistributions(documents: RDD[(Long, Vector)])]] to avoid overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scala doc for this line is not generated correctly. Can you try removing the argument and just writing [[topicDistributions]] instead?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 11, 2016

Sorry for the late response. Update sent

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49109 has finished for PR 9484 at commit 0481c44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 11, 2016

Getting many TimeoutException.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49124 has finished for PR 9484 at commit 0481c44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks!

@asfgit asfgit closed this in bbea888 Jan 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants