Bayesian Nonparametric

Bayesian Nonparametric models with Python.

Models follow scikit-learn's API and can be used as its extension.

Current model:

Hierarchical Dirichlet Process

HDP is similar to LDA (Latent Direchlet Allocation) but assumes an "infinite" number of topics. This implementation is based on Chong Wang's online-hdp and optimized with cython.

Reference:

"Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley, 2013
"Online Variational Inference for the Hierarchical Dirichlet Process", Chong Wang, John Paisley, David M. Blei, 2011
Chong Wang's online-hdp code.

Install:

# clone repoisitory
git clone git@github.com:chyikwei/bnp.git
cd bnp

# install dependencies (cython, numpy, scipy, scikit-learn)
pip install -r requirements.txt
pip install .

Getting started:

In bnp.utils we proivde a function to generate fake document-word matrix with hidden topics. We will run our HDP model with it.

First, we can generate a document-word matrix with 5 hidden topics. (each topic has 10 uniuque words and each topic has 100 docs.)

>>> from __future__ import print_function
>>> from bnp.online_hdp import HierarchicalDirichletProcess
>>> from bnp.utils import make_doc_word_matrix

>>> tf = make_doc_word_matrix(n_topics=5,
...                           words_per_topic=10,
...                           docs_per_topic=100,
...                           words_per_doc=20,
...                           shuffle=True,
...                           random_state=0)
>>> tf.shape
(500, 50)

For samples in the matrix, each row(document) only contains words from a specific topic (word 0 to 9: topic 1, 10 to 19: topic 2,...)

>>> tf[0].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 1, 4, 1, 2, 3, 3, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])
>>> tf[1].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 1, 3, 2, 1, 2, 0, 3, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])

Next we fit a HDP model with this matrix

>>> hdp = HierarchicalDirichletProcess(n_topic_truncate=10,
...                                    n_doc_truncate=3,
...                                    max_iter=5,
...                                    random_state=0)
>>> hdp.fit(tf)

Then we can print out topic proportion and top topic words in HDP model.

# print topic function
>>> def print_top_words(model, n_words):
...     topic_distr = model.topic_distribution()
...     for topic_idx in range(model.lambda_.shape[0]):
...         topic = model.lambda_[topic_idx, :]
...         message = "Topic %d (proportion: %.2f): " % (topic_idx, topic_distr[topic_idx])
...         message += " ".join([str(i) for i in topic.argsort()[:-n_words - 1:-1]])
...         print(message)

>>> print_top_words(hdp, 10)
Topic 0 (proportion: 0.20): 3 1 7 5 8 4 0 2 9 6
Topic 1 (proportion: 0.00): 49 12 22 21 20 19 18 17 16 15
Topic 2 (proportion: 0.04): 43 49 44 45 47 40 46 48 41 42
Topic 3 (proportion: 0.13): 14 18 10 15 16 12 17 19 11 13
Topic 4 (proportion: 0.07): 19 16 10 15 11 17 12 13 18 14
Topic 5 (proportion: 0.01): 23 29 28 20 21 25 26 24 27 22
Topic 6 (proportion: 0.01): 31 38 35 39 30 33 34 37 32 36
Topic 7 (proportion: 0.19): 35 31 39 30 33 38 32 34 36 37
Topic 8 (proportion: 0.16): 48 42 46 49 45 47 41 44 40 43
Topic 9 (proportion: 0.19): 21 29 28 23 20 24 26 27 25 22

Here HDP find 7 large topics (> 1%) and those can map to the hidden topics we generated before.

Examples

In bnp/examples folder. (Will add ipython notebook soon)

Running Test:

python setup.py test

Uninstall:

pip uninstall bnp

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.circleci		.circleci
bnp		bnp
ci_scripts		ci_scripts
doc		doc
examples		examples
.gitignore		.gitignore
.nojekyll		.nojekyll
.travis.yml		.travis.yml
MANIFEST.in		MANIFEST.in
README.md		README.md
appveyor.yml		appveyor.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bayesian Nonparametric

Current model:

Reference:

Install:

Getting started:

Examples

Running Test:

Uninstall:

About

Releases

Packages

Languages

chyikwei/bnp

Folders and files

Latest commit

History

Repository files navigation

Bayesian Nonparametric

Current model:

Reference:

Install:

Getting started:

Examples

Running Test:

Uninstall:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages