Skip to content

2. Installation

Sander W. van der Laan edited this page Aug 8, 2018 · 1 revision

Introduction

The scripts written in Perl, Python, R, and BASH, will work within the context of a certain Linux environment (in this case a CentOS7 system on a SUN Grid Engine background). In addition to testing MetaGWASToolKit on CentOS7, we have tested it on OS X Sierra (version 10.11.[x]) too.


Installing the scripts locally.

You can use the scripts locally to run analyses on a Unix-based system, like Mac OS X (Sierra+). We need to make an appropriate directory to download 'gits' to, and install this 'git'.

Step 1: Make a directory, and go there.

mkdir -p ~/git/ && cd ~/git

Step 2: Clone this git, unless it already exists.

if [ -d ~/git/MetaGWASToolKit/.git ]; then \
		cd ~/git/MetaGWASToolKit && git pull; \
	else \
		cd ~/git/ && git clone https://github.com/swvanderlaan/MetaGWASToolKit.git; \
	fi

Step 3: Check for dependencies of Python, Perl and R, and install them if necessary.

MetaGWASToolKit requires a couple of Python, Perl, and R specific packages and libraries installed. Most of the time these are readily available on your Mac or Linux-environment. But if not, here is a how-to to get these.

You will need to have Python 2.7.[x] installed with some obligatory packages for statistical analyses among others, these include YAML, Getopt::Long, and Statistics::Distributions. Installation can be achieved like this:

sudo cpan YAML Getopt::Long Statistics::Distributions

You will also need some Perl libraries installed for data munging and statistical analyses among others, these include numpy, scipy, scikit-learn, pandas, and argparse. Installation can be achieved like this:

pip2 install argparse numpy scipy scikit-learn pandas

You will need R version 3.4.[x]; a standard installation should suffice. At a minimum you will need optparse, tools, dplyr, tidyr, and data.table. You can install starting R and execute the following code which will also install any dependencies.

install.packages.auto <- function(x) { 
  x <- as.character(substitute(x)) 
  if(isTRUE(x %in% .packages(all.available = TRUE))) { 
    eval(parse(text = sprintf("require(\"%s\")", x)))
  } else { 
    # Update installed packages - this may mean a full upgrade of R, which in turn
    # may not be warrented. 
    #update.packages(ask = FALSE) 
    eval(parse(text = sprintf("install.packages(\"%s\", dependencies = TRUE, repos = \"http://cran-mirror.cs.uu.nl/\")", x)))
  }
  if(isTRUE(x %in% .packages(all.available = TRUE))) { 
    eval(parse(text = sprintf("require(\"%s\")", x)))
  } else {
    source("http://bioconductor.org/biocLite.R")
    # Update installed packages - this may mean a full upgrade of R, which in turn
    # may not be warrented.
    #biocLite(character(), ask = FALSE) 
    eval(parse(text = sprintf("biocLite(\"%s\")", x)))
    eval(parse(text = sprintf("require(\"%s\")", x)))
  }
}

cat("\n* Checking availability of required packages and installing if needed...\n\n")
### INSTALL PACKAGES WE NEED
install.packages.auto("optparse")
install.packages.auto("tools")
install.packages.auto("dplyr")
install.packages.auto("tidyr")
install.packages.auto("data.table")

Step 4: Installation of necessary software.

MetaGWASToolKit requires you to install several software packages.

Step 5: Create necessary databases.

You will have to download and create some data needed for MetaGWASToolKit to function. The resource.creator.sh script will automagically create the necessary files. For some of these files, it is necessary to supply the proper reference data in VCF-format (version 4.1+). The files created by resource.creator.sh include:

  • DBSNPFILE -- a dbSNP file containing information per variant based on dbSNP b150 (hg19, b37).
  • REFFREQFILE -- a file containing reference frequencies per variant for the chosen reference and population.
  • VINFOFILE -- a file needed to harmonize all the cohorts in terms of variant ID, contains various variantID versions (rs[XXXX], chr[X]:bp[XXX]:A1_A2, etc.). The resulting file is used by gwas2ref.harmonizer.py later on during harmonization.
  • GENESFILE -- a file containing chromosomal basepair positions per gene, default is GENCODE.
  • REFERENCEVCF -- needed for downstream analyses, such as clumping of genome-wide significant hits, etc.

To download and install please run the following code, this should submit various jobs to create the necessary databases.

cd ~/git/MetaGWASToolKit && bash resource.creator.sh

Available references

There are a couple of reference available per standard, these are:

  • HapMap 2 [HM2], version 2, release 22, b36. -- HM2 contains about 2.54 million variants, but does not include variants on the X-chromosome. Obviously few, if any, meta-analyses of GWAS will be based on that reference, but it's good to keep. View it as a 'legacy' feature. [NOT AVAILABLE YET] 🔷
  • 1000G phase 1, version 3 [1Gp1], b37. -- 1Gp1 contains about 38 million variants, including INDELs, and variation on the X, XY, and Y-chromosomes.
  • 1000G phase 3, version 5 [1Gp3], b37. -- 1Gp3 contains about 88 million variants, including INDELs, and variation on the X, XY, and Y-chromosomes. [NOT AVAILABLE YET] 🔶
  • Genome of the Netherlands, version 4 [GoNL4], b37. -- GoNL4 contains about xx million variants, including INDELs, and variation on the X, XY, and Y-chromosomes; some of which are unique for the Netherlands or are not present in dbSNP (yet). [NOT AVAILABLE YET] 🔷
  • Genome of the Netherlands, version 5 [GoNL5], b37. -- GoNL4 contains about xx million variants, including INDELs, and variation on the X, XY, and Y-chromosomes; some of which are unique for the Netherlands or are not present in dbSNP (yet). [NOT AVAILABLE YET] 🔷
  • Combination of 1Gp3 and GoNL5 [1Gp3GONL5], b37. -- This contains about 100 million variants, including INDELs, and variation on the X, XY, and Y-chromosomes; some of which are unique for the Netherlands or are not present in dbSNP (yet). [NOT AVAILABLE YET] 🔶

Clone this wiki locally