A pipeline for performing common genetic epidemiology tasks at the University of Bristol.
Goals:
- Implement comment steps in GWAS investigations to create reproducible, more efficient research
- Reusable pipelines that can be utilised on different projects
- Shared code and steps that can be updated according to the latest knowledge and practices
Here is a list of available pipelines, and the steps they run
| standardise_gwas.smk | compare_gwases.smk | disease_progression.smk | qtl_mr.smk |
|---|---|---|---|
| Takes in any of: vcf, csv, tsv, txt (and zip, gz) | All steps in 'standardise_gwas.smk' | All steps in 'standardise_gwas.smk' | All steps in 'standardise_gwas.smk' |
| Convert Reference Build (GRCh38 -> GRCh37) |
PLINK clumping | Runs some collider bias corrections, compares results | Run MR against top hits of specific QTL dataset |
| Populate RSID from CHR, BP, EA, and OA | Calculate heterogeneity between GWASes | Miami Plot of Collider Bias Results | Volcano Plot of Results |
| Converts z-scores and OR to BETA | LDSC h2 and rg | Expected vs. Observed Comparison | Run coloc of significant top hit MR results |
| Auto-populate GENE ID <-> ENSEMBL ID | Expected vs. Observed Comparison | ||
| Unique SNP = CHR:BP_EA_OA (EA < OA alphabetically) |
git clone git@github.com:MRCIEU/GeneHackman.git && cd GeneHackman
2. Ensure you have conda installed and initialised before activating
conda activate /mnt/storage/private/mrcieu/data/genomic_data/pipeline/genehackman
(you can alternatively create your own conda environment if you like: conda env create --file environment.yml)
- populate the DATA_DIR, RESULTS_DIR and RDFS_DIR environment variables in .env file
These should probably be in your work or scratch space (
/user/work/userid/...) - RDFS_DIR is optional. All generated files can be copied automatically. Please ensure the path
ends in
working/
- Ex:
cp snakemake/input_templates/compare_gwases.json input.json - Each pipeline (as defined in
snakemakedirectory) has its own input format. - You can either copy into input.json, or supply the file into the script from another location
./run_pipeline.sh snakemake/<specific_pipeline>.smk <optional_input_file.json>
run_pipeline.shis just a convience wrapper around thesnakemakecommand, if you want to do anything out of the ordinary, please read up on snakemake- If there are errors while running the pipeline, you can find error messages either directly on the screen, or in slurm log file that is outputted on error
- It is recommended that you run the your pipeline inside a tmux session.
The standard column naming for GWASes are:
| CHR | BP | EA | OA | BETA | SE | P | EAF | SNP | RSID |
|---|
A full list of names and default values can be found here
There are 3 main components to the pipeline
- Snakemake to define the steps to complete for each pipeline.
- Docker / Singularity container with installed languages (R and python), packages, os libraries, and code
- Slurm: each snakemake step spins up a singularity container inside a slurm job. Each step can specify different slurm requirements.
Rdirectory holds R package code that can also be called and reused by any step in the pipeline (accessed by a cli script)scriptsdirectory holds the scripts that can be easily called by snakemake (Rscript example.R --input_ex example_input)snakemakedirectory, which defines the pipeline steps and configuration, and shared code between pipelinesdockerdirectory holds the information for creating the docker image that the pipeline runstestsdirectory holds all R tests, and a end to end pipeline test script
If you want to make any additions / changes please contact andrew.elmore@bristol.ac.uk, or open an issue in this repo.