Skip to content

[SPARK-9888][MLlib]User guide for new LDA features #8254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 116 additions & 19 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,28 +438,125 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")
is a topic model which infers topics from a collection of text documents.
LDA can be thought of as a clustering algorithm as follows:

* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
on a statistical model of how text documents are generated.

LDA takes in a collection of documents as vectors of word counts.
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:

* Topics: Inferred topics, each of which is a probability distribution over terms (words).
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
* Topics correspond to cluster centers, and documents correspond to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, for the future, try not to change formatting in Markdown unnecessarily since it makes reviewing harder. There aren't style guidelines for Markdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

examples (rows) in a dataset.
* Topics and documents both exist in a feature space, where feature
vectors are vectors of word counts (bag of words).
* Rather than estimating a clustering using a traditional distance, LDA
uses a function based on a statistical model of how text documents are
generated.

LDA supports different inference algorithms via `setOptimizer` function.
`EMLDAOptimizer` learns clustering using
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
on the likelihood function and yields comprehensive results, while
`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
variational
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
and is generally memory friendly.

LDA takes the following parameters:
LDA takes in a collection of documents as vectors of word counts and the
following parameters (set using the builder pattern):

* `k`: Number of topics (i.e., cluster centers)
* `maxIterations`: Limit on the number of iterations of EM used for learning
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.

*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
support prediction on new documents, and it does not have a Python API. These will be added in the future.
* `optimizer`: Optimizer to use for learning the LDA model, either
`EMLDAOptimizer` or `OnlineLDAOptimizer`
* `docConcentration`: Dirichlet parameter for prior over documents'
distributions over topics. Larger values encourage smoother inferred
distributions.
* `topicConcentration`: Dirichlet parameter for prior over topics'
distributions over terms (words). Larger values encourage smoother
inferred distributions.
* `maxIterations`: Limit on the number of iterations.
* `checkpointInterval`: If using checkpointing (set in the Spark
configuration), this parameter specifies the frequency with which
checkpoints will be created. If `maxIterations` is large, using
checkpointing can help reduce shuffle file sizes on disk and help with
failure recovery.


All of MLlib's LDA models support:

* `describeTopics`: Returns topics as arrays of most important terms and
term weights
* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
is a topic

*Note*: LDA is still an experimental feature under active development.
As a result, certain features are only available in one of the two
optimizers / models generated by the optimizer. Currently, a distributed
model can be converted into a local model, but not vice-versa.

The following discussion will describe each optimizer/model pair
separately.

**Expectation Maximization**

Implemented in
[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
and
[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).

For the parameters provided to `LDA`:

* `docConcentration`: Only symmetric priors are supported, so all values
in the provided `k`-dimensional vector must be identical. All values
must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
(uniform `k` dimensional vector with value $(50 / k) + 1$
* `topicConcentration`: Only symmetric priors supported. Values must be
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
* `maxIterations`: The maximum number of EM iterations.

`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
the inferred topics but also the full training corpus and topic
distributions for each document in the training corpus. A
`DistributedLDAModel` supports:

* `topTopicsPerDocument`: The top topics and their weights for
each document in the training corpus
* `topDocumentsPerTopic`: The top documents for each topic and
the corresponding weight of the topic in the documents.
* `logPrior`: log probability of the estimated topics and
document-topic distributions given the hyperparameters
`docConcentration` and `topicConcentration`
* `logLikelihood`: log likelihood of the training corpus, given the
inferred topics and document-topic distributions

**Online Variational Bayes**

Implemented in
[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html)
and
[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html).

For the parameters provided to `LDA`:

* `docConcentration`: Asymmetric priors can be used by passing in a
vector with values equal to the Dirichlet parameter in each of the `k`
dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
* `topicConcentration`: Only symmetric priors supported. Values must be
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
* `maxIterations`: Maximum number of minibatches to submit.

In addition, `OnlineLDAOptimizer` accepts the following parameters:

* `miniBatchFraction`: Fraction of corpus sampled and used at each
iteration
* `optimizeDocConcentration`: If set to true, performs maximum-likelihood
estimation of the hyperparameter `docConcentration` (aka `alpha`)
after each minibatch and sets the optimized `docConcentration` in the
returned `LocalLDAModel`
* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.

`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
inferred topics. A `LocalLDAModel` supports:

* `logLikelihood(documents)`: Calculates a lower bound on the provided
`documents` given the inferred topics.
* `logPerplexity(documents)`: Calculates an upper bound on the
perplexity of the provided `documents` given the inferred topics.

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
}
val topicsMat = Matrices.fromBreeze(brzTopics)

// TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940
new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext {
// Train a model
val lda = new LDA()
lda.setK(k)
.setOptimizer(new EMLDAOptimizer)
.setDocConcentration(topicSmoothing)
.setTopicConcentration(termSmoothing)
.setMaxIterations(5)
Expand Down