Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non‑parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and easily accommodates growing data collections. The hLDA model combines this prior with a likelihood based on a hierarchical variant of Latent Dirichlet Allocation.
The original papers describing the algorithm are:
- Hierarchical Topic Models and the Nested Chinese Restaurant Process
- The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies
This repository contains a pure Python implementation of the Gibbs sampler for hLDA. It is intended for experimentation and as a reference implementation. The code follows the approach used in the original Mallet implementation but with a simplified interface and a fixed depth for the tree.
Key features include:
- Python 3.11+ support with minimal third‑party dependencies.
- A small set of example scripts demonstrating how to run the sampler.
- Utilities for visualising the resulting topic hierarchy.
- Test suite for verifying the sampler on synthetic data and a small BBC corpus.
The package can be installed directly from PyPI:
pip install hldaAlternatively, to develop locally, clone this repository and install it in editable mode:
git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit installThe easiest way to get started is by using the sample BBC dataset provided in the
data/ directory. You can run the full demonstration from the command line:
python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20If you installed the package from PyPI you can run the same demo via the
hlda-run command:
hlda-run --data-dir data/bbc/tech --iterations 20To write the learned hierarchy to disk in JSON format, pass
--export-tree <file> when running the script:
python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.jsonIf you make use of the BBC dataset, please cite the publication by Greene and
Cunningham (2006) as detailed in CITATION.cff.
Example scripts for the BBC dataset and synthetic data are available in the
examples/ directory.
Within Python you can also construct the sampler directly:
from hlda.sampler import HierarchicalLDA
corpus = [["word", "word", ...], ...] # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})
hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)The package provides a HierarchicalLDAEstimator that follows the scikit-learn API. This allows using the sampler inside a standard Pipeline.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator
vectorizer = CountVectorizer()
prep = FunctionTransformer(
lambda X: (
[[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
list(vectorizer.get_feature_names_out()),
),
validate=False,
)
pipeline = Pipeline([
("vect", vectorizer),
("prep", prep),
("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])
pipeline.fit(documents)
assignments = pipeline.transform(documents)The repository includes a small test suite that checks the sampler on both the BBC corpus and synthetic data. After installing the development dependencies you can run:
pytest -qAll tests should pass in a few seconds.
This project is licensed under the terms of the MIT license. See
LICENSE.txt for details.