Summary

pyVAE is a modification of popVAE (manuscript; GitHub) and is designed to fit a variational autoencoder (VAE) to generalized multi-dimensional data (e.g., transcriptome expression data) and output the latent space.

This page is forked from popVAE and can be used to also install an archived version of popVAE (v0.1). The manuscript describing popVAE should be cited if you use pyVAE (see #Citation below).

What is a VAE?

A VAE is a machine learning methodology which uses neural networks to learn the latent representation of data. It involves two steps, an encoding step and a decoding step (seem image below). For the purposes of of pyVAE, we use the majority of our data to train a model and -- once trained -- we can re-input out data into the model to view it in reduced-dimension latent space.

Importantly, VAEs maintain what is known as "global structure". Meaning that the distance between the points in latent space is meaningful in some way. While useful for data visualization, other machine-learning methods for dimensional-reduction which have become common in fields such as single-cell RNAseq -- such as UMAP and t-SNE -- fail to maintain global structure. These methodologies are therefore not useful for downstream analyses and can even distort the data to produce erroneous results (Chari et al. 2022).

Install

pyVAE requires python 3.7 and tensorflow 1.15. We recommend you first install anaconda3, then install in a new environment.

Clone this repo and install with:

conda create --name pyVAE python=3.7.7
conda activate pyVAE
git clone https://github.com/rhettrautsaw/pyVAE.git
cd pyVAE
python setup.py install

Run

pyVAE requires input tab-delimited txt format.

SECTION IN PROGRESS

Working on making a small test dataset. Generally you will fit a model with:

pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf --out out/pabu_test --seed 42

It should fit in less than a minute on a regular laptop CPU. For running on larger datasets we recommend using a CUDA-enabled GPU.

Output

At default settings pyVAE will output 4 files:
pabu_test_latent_coords.txt -- best-fit latent space coordinates by sample.
pabu_test_history.txt -- training and validation loss by epoch.
pabu_test_history.pdf -- a plot of training and validation loss by epoch.
pabu_test_training_preds.txt -- latent coordinates output during model training, stored every --prediction_freq epochs.

Parameters

Many hyperparameters and filtering options can be adjusted at the command line. Run pyVAE.py --h to see all parameters.

Default settings work well on most datasets, but validation loss can usually be improved by tuning hyperparameters. We've seen most effects from changing three settings: network size, early stopping patience, and the proportion of samples used for model training versus validation.

--search_network_sizes runs short optimizations for a range of network sizes and selects the network with lowest validation loss. Alternately, --depth and --width set the number of layers and the number of hidden units per layer in the network. If you're running low on GPU memory, reducing --width will help.

--patience sets the number of epochs the optimizer will run after the last improvement in validation loss -- we've found that increasing this value (to, say, 300) sometimes helps with small datasets.

--train_prop sets the proportion of samples used for model training, with the rest used for validation.

To run a grid search over a specific set of network sizes with increased patience and a larger validation set on the test data, use:

pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf \
--out out/pabu_test --seed 42 --patience 300 \
--search_network_sizes --width_range 32,256,512 \
--depth_range 3,5,8 --train_prop 0.75

Plotting

I recommend using ggpubr for plotting the results of pyVAE.

SECTION IN PROGRESS

pabu_test_latent_coords.txt

Citation

If you use pyVAE, please cite popVAE

Battey CJ, Coffing GC, Kern AD. 2021. Visualizing population structure with variational autoencoders. G3 Genes|Genomes|Genetics. 11(1):jkaa036.

as well as my own paper

IN PREP

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
build/scripts-3.7		build/scripts-3.7
data/pabu		data/pabu
dist		dist
img		img
popvae.egg-info		popvae.egg-info
scripts		scripts
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Install

Run

Output

Parameters

Plotting

Citation

About

Releases

Packages

Languages

License

RhettRautsaw/pyVAE

Folders and files

Latest commit

History

Repository files navigation

Summary

Install

Run

Output

Parameters

Plotting

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages