Skip to content
Alice Braun edited this page Oct 23, 2024 · 8 revisions

Welcome to the SAFFARI wiki!

A comprehensive statistical and functional fine-mapping pipeline incorporating 4 methods (SuSiE, FINEMAP, Polyfun+SuSiE, Polyfun+FINEMAP), two reference panels (LD panel in plink format, UKB) and different ranges of fine-mapping windows.

This workflow takes as an input GWAS summary statistics and outputs files for each method/reference panel/windows, in which the SNPs are ranked according to their posterior inclusion probability and their inclusion in a 95% credible set.

There are two different branches that are currently supported:

  • main: this branch can be used to run multiple GWAS at a time.
  • using_GWAS_QTL_dictionary: customized only for the Raj Group based in Mount Sinai. This branch can be used to run multiple GWAS at a time using the GWAS_QTL_dictionary excel file provided by the Raj Group.

The pipeline is comprised of the following 2 Snakemake modules:

  • selecting the correct UKB LD matrix for each locus to be fine-mapped, while formatting accordingly the top loci file(fetch_UKB_LD_names),

  • running the fine-mapping pipeline using GWAS summary statistics and LD reference panels (fine-mapping_multiple or fine-mapping_HRC_multiple)

  • Please make sure to check the README file here as to how you should name and where you should download the folders contacting the LD panels and priors.

Current Snakemake version used: 7.6.2.

To download the latest SAFFARI version, simply do:

git clone https://github.com/mkoromina/SAFFARI

Introduction to Snakemake

You can find full documentation of Snakemake here. In this section, I’ll provide a brief overview and highlight some particularly useful commands. Snakemake is a pipeline tool based on Python. It consists of a series of rules, each acting as a set of instructions that guide Snakemake in generating specific outputs from given inputs. When a user requests an output, Snakemake executes all necessary rules to produce that output.

--use-conda

This command tells Snakemake to create and use the conda environment specified for each rule. This is a handy and reproducible way of installing and running code in a tightly controlled software environment. This command should always be used when running SAFFARI.

-np

This command performs a dry run, where Snakemake prints out all the jobs it would run, without actually running them. This is particularly useful if you want to see what would happen if you were to specify a certain output or rule. This helps avoid accidentally triggering 100s of unwanted jobs.

--cores

This command specifies the number of cores requested to run the SAFFARI pipeline. You can increase the number of requested cores when running more computationally intensive procedures.

--configfile

This parameter can be used to specify the .yaml file you want Snakemake to use as the configuration file. Snakemake reads the default config.yaml file located in the pipeline directory to obtain its default parameters. This file is described below in detail (see here).

You can run the pipeline like this: snakemake --profile slurm --configfile config.yaml --use-conda.

Important notes

  • You will need to activate the snakemake conda/mamba environment prior to the pipeline execution.

  • Make sure to also follow the directory and file structures as found in the Github page. Main directories within SAFFARI: "workflow", "resources", "polyfun". Then "scripts" and "envs" aew subdirectories within "workflow". Please check the READMEs in these directories too.

  • To run the Snakemake pipeline, I strongly recommend setting up a cluster profile. Fully detailed instructions for configurating profiles to run Snakemake jobs can be found here. For example, you can set up a slurm profile or a lsf profile, which will allow parallelization of jobs submission and execution. In that case, a simple job submission would look like this: snakemake --profile slurm --configfile config.yaml --use-conda

  • --configfile config.yaml: make sure to list the correct top loci file per each Snakefile used. For fetch_UKB_names_LD_multiple Snakefile, you need the files with the extension of "_toploci.csv", and for finemapping_multiple, we need the files with the extension of "_loci_ranges.tsv". Examples of both type of files can be found within the resources/ folder.

!! Please note that the option --cluster-config is deprecated in the latest Snakemake versions.

  • Options for fine-mapping windows: (i) range.right ranges, representing the GWS locus windows, or, (ii) beginning and end, representing a 3Mb window (optional)

Both are outputted as part of the fetch_UKB_LD_names Snakemake module. Both of these can also be standalone arguments in the --start and --end flags of the fine-mapping rules within the Snakefile.

Inputs

  1. To run this Snakemake pipeline with the different "modules", you will need two main inputs: (i) formatted and cleaned GWAS summary sumstats ( in a .gz format), and, (ii) a list of top GWS loci for fine-mapping (stored as a .csvfile).

Both the top loci file and the GWAS sumstats should include the respective columns as outputted from RICOPILI. Please check the files within the resources folder in this Github repository to get a better overview of the essential columns needed for both inputs.

  1. GWAS sumstats should be QC'ed and any duplicate SNPs shall be removed beforehand. GWAS columns from RICOPILI-based sumstats include:

CHR SNP BP A1 A2 FRQ_A_41917 FRQ_U_371549 INFO OR SE P ngt Direction HetISqt HetDf HetPVa Nca Nco Neff_half

If you need to exclude additional SNPs according to the MAF, then FRQ ricopili columns shall be renamed prior to pipeline executions.

The script tries to be flexible and accommodate multiple file formats and column names. Minimum fields include a sample size parameter (n) and a whitespace-delimited input file with SNP rsids, chromosome and base pair info, and either a p-value, an effect size estimate and its standard error, a Z-score or a p-value.)

  1. The top loci file is derived from the RICOPILI clumping procedure and should include the minimum fields:

SNP CHR BP range.left range.right

An example of a top loci file is provided within the resources folder and range.left and range.right are defined here as the used fine-mapping windows. These columns can be customize and may include any other fine-mapping ranges as the user wants to.

Credits - Acknowledgements

This work would not have been feasible without the contribution and wonderful work of other researchers:

  • Omer Weissbrod,
  • Jonathan Coleman,
  • Ashvin Ravi,
  • Brian Fulton-Howard,
  • Brian Schilder.

Issues

Shall any issues occur when running the pipeline, please feel free to open an issue on Github by providing a mini reproducible example. Contributions are also more than welcome!

(Currently under development: integration of post-fine-mapping analyses module.)

Clone this wiki locally