multiple imputation for association studies
- Lightning Auriga (@lightning-auriga)
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
- Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone https://github.com/lightning-auriga/statgen-mi.git
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, and manifest.tsv
to specify your sample setup.
The following settings are available in primary user configuration under config/config.yaml
:
manifest
: location of primary run manifest file; defaults toconfig/manifest.tsv
tools
: configuration options for association tools supported by the workflowbcftools
: configuration options specific to bcftoolsexecutable
: command to launch bcftools (see note below)plugin_path
: path to plugin libraries for this version of bcftools- note: this workflow uses
bcftools +setGT
to randomize genotypes from imputed probabilities. this functionality is not actually present in+setGT
, but has been modded in; see this repo for the code. eventually, this will hopefully get migrated somewhere more useful like conda, but a local build suffices for this early stage. recommend placing the bcftools repo at../bcftools
relative to this workflow
plink2
: configuration options specific to plink2 --glm methodsexecutable
: command to launch plink2. if using conda, this should remain defaultplink2
maxthreads
: maximum number of threads to deploy in a plink2 taskmaxmem
: maximum RAM (in MB) supplied to a plink2 taskmi_draws
: number of multiple imputation simulated sets to run for plink2 MI runs
imputed_datasets
: user-defined sets of imputed data that can be selected for analysis- each tag under
imputed_datasets
should be unique, and can be used to refer to the dataset in the manifest - each tag should contain under it:
type
: descriptor of imputed file type. currently the only accepted value isminimac4
filename
: full path to and name of imputed data file(s). for minimac4: the dose.vcf.gz file(s). if multiple paths are specified in an array, the files will each be processed in turn and concatenated (in order) after run completion
- each tag under
regression_models
: user-defined sets of phenotypes and covariates that can be selected for analysis- each tag under
regression_models
should be unique, and can be used to refer to the model in the manifest - each tag should contain under it:
filename
: full path to and name of plink-format phenotype file containing any relevant variables. other variables can also be presentphenotype
: primary outcome for this regression model, as the corresponding header entry in the phenotype filecovariates
: (optional) list of covariates for this regression model, as the corresponding header entry or entries in the phenotype filemodel
: descriptor of association type. currently recognized options arelinear
orlogistic
vif
: (optional) for tools that support this, primarily plink: variance inflation factor cap above which a model is suppressed
- each tag under
queue
: user-defined configuration data for compute queuesmall_partition
: slurm partition (or equivalent for other cluster profiles) for jobs with the following restrictions:- max RAM will never exceed 3500M
- max time will be less than 10 minutes
- this is exposed to save money, but can just be set to the same value as the below partition if desired
large_partition
: slurm partition (or equivalent for other cluster profiles) for jobs using maximum per-tool analysis settings, with the following additional restrictions:- at least 8000M RAM should be available for a task
- jobs on the partition should be permitted to run at least four hours before being killed
Each desired MI run should be configured in a row of the run manifest, by default at config/manifest.tsv
. The following entries are required for each run:
analysis
: unique identifier for this particular runimputed_dataset
: tag for desired imputed dataset to use, as enumerated inconfig/config.yaml
tool
: supported association tool for analysisregression_model
: tag for desired regression model to use, as enumerated inconfig/config.yaml
Install Snakemake using conda:
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake --use-conda --profile /path/to/slurm-profile --jobs 100
See the Snakemake documentation for further details.
More information will be added here as it becomes available. For now, draft results are populated under results/{analysis}/{imputed_dataset}/{tool}/{regression_model}
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Testing for embedded snakemake python scripts is in workflow/scripts/tests
and handled with pytest
. snakemake_unit_tests
integration TBD.
- see the changelog for details.
- note that this was originally written under the name "Cameron Palmer"