This repository contains Python code for Bayesian Nonparametric Learning with a Dirichlet process prior. More details can be found in the paper below:
Fong, E., Lyddon, S. and Holmes, C. Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. https://arxiv.org/abs/1902.03175
To install the npl package, clone the repository and run
python3 setup.py develop
Although the setup installs the packages automatically, you may need to install pystan separately using pip if setuptools isn't working correctly. Please make sure the version of pystan is newer than v2.19.0.0 or the evaluate scripts may not work properly. The code has been tested on Python 3.6.7.
- Current implementation will use all cores available on the local computer. If this is undesired, pass the number of cores as
n_coresto the functionbootstrap_gmmorbootstrap_logregin the run scripts,. - If running on multi-core computer, make sure to restrict
numpyto use 1 thread per process forjoblibto parallelize without CPU oversubscription, with the bash command:export OPENBLAS_NUM_THREADS=1
A directory overview is given below:
-
npl- Contains main functions for the posterior bootstrap and evaluating posterior samples on test databootstrap_logreg.pyandbootstrap_gmm.pycontain the main posterior bootstrap sampling functions for generating the randomized weights and parallelizing.maximise_logreg.pyandmaximise_gmm.pycontain functions for sampling the prior pseudo-samples, initialising random restarts and maximising the weighted log likelihood. These functions can be edited to use NPL with different models and priors../evaluatecontains functions for calculating log posterior predictives of the different posteriors.
-
experiments- Contains scripts for running main experiments -
supp_experiments- Contains scripts for running supplementary experiments
- Download MNIST files from http://yann.lecun.com/exdb/mnist/.
- Extract and place in
./samples, so the folder contains the files:
t10k-images-idx3-ubyte
t10k-labels-idx1-ubyte
train-images-idx3-ubyte
train-labels-idx1-ubyte
- Download the Adult, Polish companies bankruptcy 3rd year, and Arcene datasets from UCI Machine Learning Repository, links below:
- Adult - https://archive.ics.uci.edu/ml/datasets/adult
- Polish- https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data
- Arcene- https://archive.ics.uci.edu/ml/datasets/Arcene
- Extract and place all data files in
./data, so the folder contains the files:
3year.arff
adult.data
adult.test
arcene_train.data
arcene_train.labels
arcene_valid.data
arcene_valid.labels
- Run
generate_gmm.pyto generate toy data. The files in./sim_data_plotare the train/test data used for the plots in the paper, and the files in./sim_dataare the datasets for the tabular results. - Run
run_NPL_toygmm.pyfor the NPL example andrun_stan_toygmm.pyfor the NUTS and ADVI examples. - Run
evaluate_posterior_toygmm.pyto evaluate posterior samples. The Jupyter notebookPlot bivariate KDEs for GMM.ipynbcan be used to produce posterior plots.
- Run
run_NPL_MNIST.pyfor the NPL example andrun_stan_MNIST.pyfor the NUTS and ADVI examples. - Run
evaluate_posterior_MNIST.pyto evaluate posterior samples. The Jupyter notebookPlot MNIST KDE.ipynbcan be used to produce posterior plots.
- Run
load_data.pyto preprocess data and generate different train-test splits. - Run
run_NPL_logreg.pyfor the NPL example andrun_stan_logreg.pyfor the NUTS and ADVI examples. - Run
evaluate_posterior_logreg.pyto evaluate posterior samples. The Jupyter notebookPlot marginal KDE (for Adult).ipynbcan be used to produce posterior plots.
- Covariate data is not included for privacy reasons. Run
load_data.pyto generate simulated covariates from Normal(0,1) (uncorrelated unlike real data) and pseudo-phenotypes. - Run
run_NPL_genetics.pyfor the NPL example. - The Jupyter notebook
Plotting Sparsity Plots.ipynbcan be used to produce sparsity plots.
- The Jupyter notebook
Normal location model.ipynbcontains all experiments and plots.
- Run
generate_gmm.pyto generate toy data. The files in./sim_data_plotare the train/test data used for the plots in the paper. - Run
run_NPL_toygmm.pyfor the NPL example (note that the MDP example will be run too) andrun_IS_toygmm.pyfor the importance sampling example. - Run
evaluate_posterior_toygmm.pyto evaluate posterior samples on test data.
- Run
generate_gmm.pyto generate toy data. The files in./sim_data_plotare the train/test data used for the plots in the paper, and the files in./sim_dataare for the tabular results. - First run
run_stan_toygmmto generate the NUTS (required for MDP-NPL) and ADVI samples, then runrun_NPL_toygmm.pyfor MDP-NPL and DP-NPL (note that the IS example will be run too). - Run
evaluate_posterior_toygmm.pyto evaluate posterior samples on test data. The Jupyter notebookPlot bivariate KDEs for GMM.ipynbcan be used to produce posterior plots.