-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
28 changed files
with
1,608 additions
and
1,473 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,49 +1,102 @@ | ||
## gvanno - germline variant annotator | ||
## gvanno - *g*ermline *v*ariant *anno*tator | ||
|
||
### Overview | ||
|
||
The germline variant annotator (gvanno) is a stand-alone software package intended for analysis and interpretation of human germline calls. It accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The software extends basic gene and variant annotations from the [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html) with up-to-date annotations retrieved flexibly through [vcfanno](https://github.com/brentp/vcfanno). | ||
The germline variant annotator (*gvanno*) is a stand-alone software package intended for analysis and interpretation of human germline calls. It accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The software extends basic annotations from [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html) with up-to-date functional and clinical variant annotations retrieved flexibly through [vcfanno](https://github.com/brentp/vcfanno). | ||
|
||
#### Annotation resources included in gvanno - v0.1 | ||
#### Annotation resources included in _gvanno_ - 0.2.0 | ||
|
||
* [VEP v85](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor release 85 (GENCODE v19 as the gene model) | ||
* [dBNSFP v3.2](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (March 2016) | ||
* [ExAC r0.3.1](http://exac.broadinstitute.org/) - Germline variant frequencies exome-wide (March 2016) | ||
|
||
* [VEP v90](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor release 90 (GENCODE v27 as the gene reference dataset) | ||
* [dBNSFP v3.4](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (March 2017) | ||
* [gnomAD r1](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (March 2017) | ||
* [dbSNP b147](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (April 2016) | ||
* [1000Genomes phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) | ||
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (Nov 2016) | ||
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) | ||
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (November 2017) | ||
* [DoCM](http://docm.genome.wustl.edu) - Database of curated mutations (v3.2, April 2016) | ||
* [UniProt/SwissProt KnowledgeBase 2016_09](http://www.uniprot.org) - Resource on protein sequence and functional information (Sep 2016) | ||
* [Pfam v30](http://pfam.xfam.org) - Database of protein families and domains (June 2016) | ||
* [CIViC](http://civic.genome.wustl.edu) - Clinical interpretations of variants in cancer (November 11th 2017) | ||
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (May 2017) | ||
* [UniProt/SwissProt KnowledgeBase 2017_10](http://www.uniprot.org) - Resource on protein sequence and functional information (October 2017) | ||
* [Pfam v31](http://pfam.xfam.org) - Database of protein families and domains (March 2017) | ||
* [TSGene v2.0](http://bioinfo.mc.vanderbilt.edu/TSGene/) - Tumor suppressor/oncogene database (November 2015) | ||
* [DisGenNet v4.0 - gene-disease associations](http://www.disgenet.org) (April 2016) | ||
|
||
### Getting started | ||
|
||
#### STEP 0: Python | ||
|
||
A local installation of Python (it has been tested with [version 2.7.13](https://www.python.org/downloads/)) is required to run gvanno. Check that Python is installed by typing `python --version` in a terminal window. In addition, a [Python library](https://github.com/uiri/toml) for parsing configuration files encoded with [TOML](https://github.com/toml-lang/toml) is needed. To install, simply run the following command: | ||
|
||
pip install toml | ||
|
||
#### STEP 1: Installation of Docker | ||
|
||
1. TODO (Ghis): Bullet-proof Docker installation instructions (Mac, Windows(?), Linux) | ||
2. __IMPORTANT__ - The following represent the _minimal_ computing resources that must be assigned to the Docker virtual machine: | ||
* Memory: 5GB | ||
* CPUs: 4 | ||
1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform | ||
- installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/) | ||
- installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/) | ||
- NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with _gvanno_ (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607)) | ||
2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window | ||
3. Adjust the computing resources dedicated to the Docker, i.e.: | ||
- Memory: minimum 5GB | ||
- CPUs: minimum 4 | ||
- [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced) | ||
|
||
#### STEP 2: Download _gvanno_ | ||
|
||
1. Download and unpack the [latest software release (0.2.0)](https://github.com/sigven/gvanno/releases/tag/v0.2.0) | ||
2. Download and unpack the data bundle (approx. 15Gb) in the _gvanno_ directory | ||
* Download [the accompanying data bundle](https://drive.google.com/file/d/1NSeMWpLVMBcCEDYpOLsuWSnKfZEaamip/) from Google Drive to `~/gvanno-X.X` (replace _X.X_ with the version number, e.g `~/gvanno-0.2.0`) | ||
* Unpack the data bundle, e.g. through the following Unix command: `gzip -dc gvanno.databundle.GRCh37.YYYYMMDD.tgz | tar xvf -` | ||
|
||
A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced | ||
3. Pull the [_gvanno_ Docker image (0.2.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.2Gb): | ||
* `docker pull sigven/gvanno:0.2.0` (_gvanno_ annotation engine) | ||
|
||
#### STEP 3: Input preprocessing | ||
|
||
The _gvanno_ workflow accepts a single input file: | ||
|
||
* An unannotated (preferably single sample) VCF file (>= v4.2) with called germline variants (SNVs/InDels) | ||
|
||
* __NOTE__: GRCh37 is currently supported as the reference genome build | ||
|
||
* We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html) | ||
* If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose) | ||
|
||
|
||
#### STEP 4: Run example | ||
|
||
Run the workflow with **gvanno.py**, which takes the following arguments and options: | ||
|
||
usage: gvanno.py [-h] [--input_vcf INPUT_VCF] | ||
[--force_overwrite] [--version] | ||
gvanno_dir output_dir configuration_file sample_id | ||
|
||
Germline variant annotation workflow for clinical and functional interpretation of | ||
single nucleotide variants and short insertions/deletions | ||
|
||
positional arguments: | ||
gvanno_dir gvanno base directory with accompanying data directory, | ||
e.g. ~/gvanno-0.2.0 | ||
output_dir Output directory | ||
configuration_file gvanno configuration file (TOML format) | ||
sample_id Sample identifier - prefix for | ||
output files | ||
|
||
For Docker version 1.13 on Mac OSX there is an option to change CPU's and RAM from the UI and restart Docker. This can be found through Docker Preferences (Advanced) in the toolbar: | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
--input_vcf INPUT_VCF | ||
VCF input file with somatic query variants | ||
(SNVs/InDels). Note: GRCh37 is currently the only | ||
reference genome build supported (default: None) | ||
--force_overwrite By default, the script will fail with an error if any | ||
output file already exists. You can force the | ||
overwrite of existing result files by using this flag | ||
(default: False) | ||
--version show program's version number and exit | ||
|
||
<img src="Docker_VM_compute_config_MacOSX.png" height="450px" width="400px"> | ||
|
||
#### STEP 2: Installation of gvanno (GRCh37) | ||
|
||
1. Make a gvanno directory, e.g. `mkdir ~/gvanno` | ||
2. Download and unpack the data bundle (approx. 16Gb) in the gvanno directory | ||
* `cd ~/gvanno` | ||
* Download the [data bundle](https://drive.google.com/drive/folders/0B8aYD2TJ472mRUpFTEc4YzlTSUk) to `~/gvanno` | ||
* Decompress and untar the data bundle, e.g.: `tar -xvzf gvanno.bundle.v0.1.grch37.tgz` | ||
3. Pull the gvanno Docker image from DockerHub: | ||
* `docker pull sigven/gvanno:latest` | ||
4. Download the [gvanno pipeline script](https://github.com/sigven/gvanno/releases/download/v0.1/gvanno.sh) to `~/gvanno` | ||
|
||
#### STEP 3: Run example | ||
### Contact | ||
|
||
1. Download the [bgzipped example VCF](https://github.com/sigven/gvanno/releases) to `~/gvanno` | ||
2. Run gvanno annotation: | ||
`./gvanno.sh ~/gvanno ~/gvanno example.vcf.gz example.annotated.vcf` | ||
sigven@ifi.uio.no |
Binary file not shown.
Binary file not shown.
Oops, something went wrong.