Skip to content

salma2vec/LDA-Simplex-Geometry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Latent Dirichlet Allocation & Simplex Geometry:

LDA generative process

LDA assumes that each document is generated by selecting a set of topics, and each topic is characterized by a distribution over words. The generative process follows these steps:

  1. Sample document length:
    The number of words ( n ) in a document is drawn from a Poisson distribution:

    $$ n \sim \text{Poisson}(\lambda) $$

    This assumes word occurrences are independent, simplifying modeling.

  2. Sample document-topic proportions:
    A document’s topic proportions θ_d are drawn from a Dirichlet distribution with parameter α:

    θ_d ~ Dirichlet(α)

    Lower α leads to sparsity (fewer topics per document), higher α results in a uniform mix.

  3. Assign topics to words:
    Each word is assigned a topic z_{d,n} from a multinomial distribution parameterized by θ_d:

    z_{d,n} ~ Multinomial(θ_d)

  4. Sample words from topics:
    Given the assigned topic, the word is drawn from the corresponding topic-word distribution φ_k:

    w_{d,n} ~ Multinomial(φ_{z_{d,n}})

  5. Topic-word distributions:
    Each topic k has a distribution over words, drawn from a Dirichlet prior with parameter β:

    φ_k ~ Dirichlet(β)


Inference techniques in LDA

Since the posterior distribution of the latent variables given the observed data is intractable, approximate inference methods are used. Two primary approaches are Gibbs Sampling and Variational Inference.

Gibbs Sampling (MCMC-based)

Gibbs Sampling iteratively estimates topic assignments by sampling each topic conditioned on all other assignments. At each iteration, the conditional probability for assigning a topic to a word is computed using:

$$ P(z_{d,n} = k | \text{rest}) \propto \frac{n_{k,w} + \beta}{n_k + V\beta} \cdot \frac{n_{d,k} + \alpha}{n_d + K\alpha} $$

where:

  • ( n_{k,w} ) = count of word ( w ) in topic ( k )
  • ( n_k ) = total words assigned to topic ( k )
  • ( n_{d,k} ) = topic count in document ( d )
  • ( n_d ) = total words in document ( d )
  • ( V ) = vocabulary size
  • ( K ) = number of topics

Steps in Gibbs Sampling:

  1. Remove current word-topic assignment.
  2. Compute probabilities for all topics.
  3. Sample a new topic assignment.
  4. Update count matrices.

Variational Inference (Optimization-based)

Variational methods approximate the posterior distribution with a factorized distribution to optimize the evidence lower bound (ELBO). The approximations are:

$$ q(\theta) = \prod_d \text{Dirichlet}(\gamma_d), \quad q(\phi) = \prod_k \text{Dirichlet}(\lambda_k) $$

To optimize, iterative updates are applied:

$$ \phi_{n,k} \propto \exp(\mathbb{E}[\log \theta_k] + \mathbb{E}[\log \phi_k]) $$

$$ \gamma_d = \alpha + \sum_{n} \phi_{n,k} $$

Variational inference scales well but introduces approximation error.


Dirichlet-multinomial Conjugacy

A key aspect of LDA's inference process is the conjugate relationship between the Dirichlet and multinomial distributions. Given the prior:

$$ \phi_k \sim \text{Dirichlet}(\beta) $$

After observing word counts ( n_{k,w} ), the posterior is:

$$ \phi_k | \text{data} \sim \text{Dirichlet}(\beta + n_{k,\cdot}) $$

This conjugacy simplifies computation and allows efficient sampling.


Simplex geometry and LDA

LDA operates in a high-dimensional simplex space, where each document is a point inside a topic simplex. For ( K ) topics, the representation exists in a ( (K-1) )-dimensional space.

  • 2D Simplex: Triangle (3 topics).
    Each vertex represents a pure topic, and points inside represent mixtures of the three topics.

  • 3D Simplex: Tetrahedron (4 topics).
    Documents cluster near edges or faces depending on topic dominance.

Geometric representation:
In an ( n )-dimensional space, each topic corresponds to a basis vector, e.g.:

$$ \text{topic}_1 = (1,0,0), \quad \text{topic}_2 = (0,1,0), \quad \text{topic}_3 = (0,0,1) $$

New words increase the dimensionality of the space. Simplex projections help visualize topic mixtures.


Next

  • Why Dirichlet priors work for LDA? Understanding the role of priors in sparsity.

  • High-dim intuition: Visualizing LDA in more than 3D.

  • Alternative topic models: Exploring PLSA, HDP, and neural topic models.

  • Evaluating topic coherence: Metrics beyond perplexity.

  • Online LDA: Extending LDA for streaming data.

About

Probabilistic topic modeling w/ LDA, focusing on simplex structures and efficient inference methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages