LDA assumes that each document is generated by selecting a set of topics, and each topic is characterized by a distribution over words. The generative process follows these steps:
-
Sample document length:
The number of words ( n ) in a document is drawn from a Poisson distribution:$$ n \sim \text{Poisson}(\lambda) $$ This assumes word occurrences are independent, simplifying modeling.
-
Sample document-topic proportions:
A document’s topic proportionsθ_d
are drawn from a Dirichlet distribution with parameterα
:θ_d ~ Dirichlet(α)
Lower
α
leads to sparsity (fewer topics per document), higherα
results in a uniform mix. -
Assign topics to words:
Each word is assigned a topicz_{d,n}
from a multinomial distribution parameterized byθ_d
:z_{d,n} ~ Multinomial(θ_d)
-
Sample words from topics:
Given the assigned topic, the word is drawn from the corresponding topic-word distributionφ_k
:w_{d,n} ~ Multinomial(φ_{z_{d,n}})
-
Topic-word distributions:
Each topick
has a distribution over words, drawn from a Dirichlet prior with parameterβ
:φ_k ~ Dirichlet(β)
Since the posterior distribution of the latent variables given the observed data is intractable, approximate inference methods are used. Two primary approaches are Gibbs Sampling and Variational Inference.
Gibbs Sampling iteratively estimates topic assignments by sampling each topic conditioned on all other assignments. At each iteration, the conditional probability for assigning a topic to a word is computed using:
where:
- ( n_{k,w} ) = count of word ( w ) in topic ( k )
- ( n_k ) = total words assigned to topic ( k )
- ( n_{d,k} ) = topic count in document ( d )
- ( n_d ) = total words in document ( d )
- ( V ) = vocabulary size
- ( K ) = number of topics
Steps in Gibbs Sampling:
- Remove current word-topic assignment.
- Compute probabilities for all topics.
- Sample a new topic assignment.
- Update count matrices.
Variational methods approximate the posterior distribution with a factorized distribution to optimize the evidence lower bound (ELBO). The approximations are:
To optimize, iterative updates are applied:
Variational inference scales well but introduces approximation error.
A key aspect of LDA's inference process is the conjugate relationship between the Dirichlet and multinomial distributions. Given the prior:
After observing word counts ( n_{k,w} ), the posterior is:
This conjugacy simplifies computation and allows efficient sampling.
LDA operates in a high-dimensional simplex space, where each document is a point inside a topic simplex. For ( K ) topics, the representation exists in a ( (K-1) )-dimensional space.
-
2D Simplex: Triangle (3 topics).
Each vertex represents a pure topic, and points inside represent mixtures of the three topics. -
3D Simplex: Tetrahedron (4 topics).
Documents cluster near edges or faces depending on topic dominance.
Geometric representation:
In an ( n )-dimensional space, each topic corresponds to a basis vector, e.g.:
New words increase the dimensionality of the space. Simplex projections help visualize topic mixtures.
-
Why Dirichlet priors work for LDA? Understanding the role of priors in sparsity.
-
High-dim intuition: Visualizing LDA in more than 3D.
-
Alternative topic models: Exploring PLSA, HDP, and neural topic models.
-
Evaluating topic coherence: Metrics beyond perplexity.
-
Online LDA: Extending LDA for streaming data.