-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-10809] [MLlib] Single-document topicDistributions method for LocalLDAModel #9484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #45087 has finished for PR 9484 at commit
|
* @return (document ID, topic mixture distribution for document) | ||
*/ | ||
@Since("1.6.0") | ||
def topicDistributions(document: (Long, Vector)): (Long, Vector) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove the doc ID? It's not necessary for a single doc, and removing it will make this more Java-friendly.
Test build #45202 has finished for PR 9484 at commit
|
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines. (cherry picked from commit e281b87) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [apache/spark#9484], but I'll try to merge [apache/spark#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
* to [[topicDistributions(documents: RDD[(Long, Vector)])]] to avoid overhead. | ||
* | ||
* @param document document to predict topic mixture distributions for | ||
* @return (document ID, topic mixture distribution for document) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this line (no doc ID)
@hhbyyh Sorry again for the delay, but we can get this merged now |
@jkbradley It's quite all right. Thanks for reviewing. Update sent. |
Test build #48895 has finished for PR 9484 at commit
|
* literature). Returns a vector of zeros for an empty document. | ||
* | ||
* Note this means to allow quick query for single document. For batch documents, please refer | ||
* to [[topicDistributions(documents: RDD[(Long, Vector)])]] to avoid overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scala doc for this line is not generated correctly. Can you try removing the argument and just writing [[topicDistributions]]
instead?
Sorry for the late response. Update sent |
Jenkins, retest this please. |
Test build #49109 has finished for PR 9484 at commit
|
Getting many TimeoutException. |
Test build #49124 has finished for PR 9484 at commit
|
LGTM |
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.