pyVAE is a modification of popVAE (manuscript; GitHub) and is designed to fit a variational autoencoder (VAE) to generalized multi-dimensional data (e.g., transcriptome expression data) and output the latent space.
This page is forked from popVAE and can be used to also install an archived version of popVAE (v0.1). The manuscript describing popVAE should be cited if you use pyVAE (see #Citation below).
What is a VAE?
A VAE is a machine learning methodology which uses neural networks to learn the latent representation of data. It involves two steps, an encoding step and a decoding step (seem image below). For the purposes of of pyVAE, we use the majority of our data to train a model and -- once trained -- we can re-input out data into the model to view it in reduced-dimension latent space.
Importantly, VAEs maintain what is known as "global structure". Meaning that the distance between the points in latent space is meaningful in some way. While useful for data visualization, other machine-learning methods for dimensional-reduction which have become common in fields such as single-cell RNAseq -- such as UMAP and t-SNE -- fail to maintain global structure. These methodologies are therefore not useful for downstream analyses and can even distort the data to produce erroneous results (Chari et al. 2022).
pyVAE requires python 3.7 and tensorflow 1.15. We recommend you first install anaconda3, then install in a new environment.
Clone this repo and install with:
conda create --name pyVAE python=3.7.7
conda activate pyVAE
git clone https://github.com/rhettrautsaw/pyVAE.git
cd pyVAE
python setup.py install
pyVAE requires input tab-delimited txt format.
SECTION IN PROGRESS
Working on making a small test dataset. Generally you will fit a model with:
pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf --out out/pabu_test --seed 42
It should fit in less than a minute on a regular laptop CPU. For running on larger datasets we recommend using a CUDA-enabled GPU.
At default settings pyVAE will output 4 files:
pabu_test_latent_coords.txt
-- best-fit latent space coordinates by sample.
pabu_test_history.txt
-- training and validation loss by epoch.
pabu_test_history.pdf
-- a plot of training and validation loss by epoch.
pabu_test_training_preds.txt
-- latent coordinates output during model training, stored every --prediction_freq
epochs.
Many hyperparameters and filtering options can be adjusted at the command line.
Run pyVAE.py --h
to see all parameters.
Default settings work well on most datasets, but validation loss can usually be improved by tuning hyperparameters. We've seen most effects from changing three settings: network size, early stopping patience, and the proportion of samples used for model training versus validation.
--search_network_sizes
runs short optimizations for a range of network sizes and selects the network with lowest validation loss. Alternately, --depth
and --width
set the number of layers and the number of hidden units per layer in the network. If you're running low on GPU memory, reducing --width
will help.
--patience
sets the number of epochs the optimizer will run after the last improvement in validation loss -- we've found that increasing this value (to, say, 300) sometimes helps with small datasets.
--train_prop
sets the proportion of samples used for model training, with the rest used for validation.
To run a grid search over a specific set of network sizes with increased patience and a larger validation set on the test data, use:
pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf \
--out out/pabu_test --seed 42 --patience 300 \
--search_network_sizes --width_range 32,256,512 \
--depth_range 3,5,8 --train_prop 0.75
I recommend using ggpubr for plotting the results of pyVAE.
SECTION IN PROGRESS
pabu_test_latent_coords.txt
If you use pyVAE, please cite popVAE
as well as my own paper
IN PREP