0.3.1 release

sigven · Jul 5, 2018 · c96d8ff · c96d8ff
1 parent 3442d44
commit c96d8ff
Show file tree

Hide file tree

Showing 10 changed files with 546 additions and 92 deletions.
diff --git a/README.md b/README.md
@@ -2,24 +2,28 @@
 
 ### Overview
 
-The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. It accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The software is largely based on [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and extends this with clinically relevant annotations retrieved flexibly through [vcfanno](https://github.com/brentp/vcfanno). The workflow produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
+The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. It accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow is largely based on [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
 
-#### Annotation resources included in _gvanno_ - 0.3.0
+#### Annotation resources included in _gvanno_ - 0.3.1
 
 * [VEP v92](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor release 92 (GENCODE v19/v28 as the gene reference dataset)
 * [dBNSFP v3.5](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (August 2017)
-* [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (October 2017)
-* [dbSNP b150](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (February 2017)
-* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013)
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (April 2018)
+* [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP
+* [dbSNP b150](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (February 2017) - from VEP
+* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
+* [ClinVar 20180603](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2018)
 * [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v5.0, May 2017)
-* [UniProt/SwissProt KnowledgeBase 2018_03](http://www.uniprot.org) - Resource on protein sequence and functional information (March 2018)
+* [UniProt/SwissProt KnowledgeBase 2018_06](http://www.uniprot.org) - Resource on protein sequence and functional information (June 2018)
 * [Pfam v31](http://pfam.xfam.org) - Database of protein families and domains (March 2017)
 * [TSGene v2.0](http://bioinfo.mc.vanderbilt.edu/TSGene/) - Tumor suppressor/oncogene database (November 2015)
 
 ### News
 
-* April 20th 2018 - 0.3.0 release
+
+* July 5th 2018 - **0.3.1 release**
+     * Data bundle updates (ClinVar, UniProt)
+     * Addition of [VEP LofTee plugin](https://github.com/konradjk/loftee) - predicts loss-of-function variants
+* April 20th 2018 - **0.3.0 release**
 	* Runs under Python3
 	* VEP version 92
 	* Support for grch38
@@ -47,15 +51,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Download and unpack the [latest software release (0.3.0)](https://github.com/sigven/gvanno/releases/tag/v0.3.0)
+1. Download and unpack the [latest software release (0.3.1)](https://github.com/sigven/gvanno/releases/tag/v0.3.1)
 2. Download and unpack the assembly-specific data bundle in the PCGR directory
-   * [grch37 data bundle](https://drive.google.com/open?id=1M4jUFLk5LwfgiWZOkKXNmQFPhl75Iy4-) (approx 9Gb)
-   * [grch38 data bundle](https://drive.google.com/file/d/1EfpUlaR8DRwFZjhJAJ8mkbbqlpENIlx5/) (approx 9Gb)
+   * [grch37 data bundle](https://drive.google.com/file/d/15NbYwwnb8J5IGhL6-RJXpAeQ-xqzjc5F/) (approx 9Gb)
+   * [grch38 data bundle](https://drive.google.com/file/d/1hr4MShsEh2Xf-_bBgDPi7t-vj32XrWJ0/) (approx 9Gb)
    * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
     A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (0.3.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.7Gb):
-   * `docker pull sigven/gvanno:0.3.0` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (0.3.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.5Gb):
+   * `docker pull sigven/gvanno:0.3.1` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -84,7 +88,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 	positional arguments:
 	gvanno_dir            gvanno base directory with accompanying data
-				    directory, e.g. ~/gvanno-0.2.0
+				    directory, e.g. ~/gvanno-0.3.1
 	output_dir            Output directory
 	{grch37,grch38}       grch37 or grch38
 	configuration_file    gvanno configuration file (TOML format)
@@ -101,10 +105,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 	--version             show program's version number and exit
 
 
-The _examples_ folder contain an example VCF file. It also contain *gvanno* configuration file. Analysis of the example VCF can be performed by the following command:
+The _examples_ folder contains an example VCF file. It also contains a *gvanno* configuration file. Analysis of the example VCF can be performed by the following command:
 
-`python ~/gvanno-0.3.0/gvanno.py --input_vcf ~/gvanno-0.3.0/examples/example.vcf.gz`
-` ~/gvanno-0.3.0 ~/gvanno-0.3.0/examples grch37 ~/gvanno-0.3.0/examples/gvanno_config.toml example`
+`python ~/gvanno-0.3.1/gvanno.py --input_vcf ~/gvanno-0.3.1/examples/example.vcf.gz`
+` ~/gvanno-0.3.1 ~/gvanno-0.3.1/examples grch37 ~/gvanno-0.3.1/examples/gvanno_config.toml example`
 
 
 This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
@@ -114,6 +118,8 @@ This command will run the Docker-based *gvanno* workflow and produce the followi
 
 Similar files are produced for all variants, not only variants with a *PASS* designation.
 
+Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file.
+
 
 
 ### Contact

diff --git a/gvanno.py b/gvanno.py
@@ -11,7 +11,7 @@
 import platform
 import toml
 
-version = '0.3.0'
+version = '0.3.1'
 
 def __main__():
 
@@ -192,7 +192,7 @@ def verify_input_files(input_vcf, configuration_file, gvanno_config_options, bas
    f_rel_not = open(rel_notes_file,'r')
    compliant_data_bundle = 0
    for line in f_rel_not:
-      version_check = 'GVANNO_DB_VERSION = 20180416'
+      version_check = 'GVANNO_DB_VERSION = 20180629'
       if version_check in line:
          compliant_data_bundle = 1
 
@@ -294,10 +294,10 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
    if not input_vcf_docker == 'None':
 
       ## Define input, output and temporary file names
-      output_vcf = '/workdir/output/' + str(sample_id) + '_gvanno.vcf.gz'
-      output_tsv = '/workdir/output/' + str(sample_id) + '_gvanno.tsv'
-      output_pass_vcf = '/workdir/output/' + str(sample_id) + '_gvanno_pass.vcf.gz'
-      output_pass_tsv = '/workdir/output/' + str(sample_id) + '_gvanno_pass.tsv'
+      output_vcf = '/workdir/output/' + str(sample_id) + '_gvanno_' + str(genome_assembly) + '.vcf.gz'
+      output_tsv = '/workdir/output/' + str(sample_id) + '_gvanno_'  + str(genome_assembly) + '.tsv'
+      output_pass_vcf = '/workdir/output/' + str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.vcf.gz'
+      output_pass_tsv = '/workdir/output/' + str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.tsv'
       input_vcf_gvanno_ready = '/workdir/output/' + re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf.gz',host_directories['input_vcf_basename_host'])
       vep_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_vep.vcf',input_vcf_gvanno_ready)
       vep_vcfanno_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_vep.vcfanno.vcf',input_vcf_gvanno_ready)
@@ -310,7 +310,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       if genome_assembly == 'grch38':
          vep_assembly = 'GRCh38'
          fasta_assembly = "/usr/local/share/vep/data/homo_sapiens/92_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
-      vep_options = "--vcf --check_ref --flag_pick_allele --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " --hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --dir /usr/local/share/vep/data"
+      vep_options = "--vcf --check_ref --flag_pick_allele --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " --hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin LoF --dir /usr/local/share/vep/data"
       if config_options['other']['vep_skip_intergenic'] == 1:
          vep_options = vep_options + " --no_intergenic"
       vep_main_command = str(docker_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_tmp_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + "\""
@@ -331,7 +331,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       print()
       logger = getlogger('gvanno-vcfanno')
       logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, dbNSFP, UniProtKB)")
-      gvanno_vcfanno_command = str(docker_command_run2) + "gvanno_vcfanno.py --num_processes "  + str(config_options['other']['n_vcfanno_proc']) + " --dbnsfp --clinvar --uniprot --gvanno_xref " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + " /data/data/" + str(genome_assembly) + "\""
+      gvanno_vcfanno_command = str(docker_command_run2) + "gvanno_vcfanno.py --num_processes "  + str(config_options['other']['n_vcfanno_proc']) + " --dbnsfp --clinvar --uniprot --pcgr_onco_xref " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + " /data/data/" + str(genome_assembly) + "\""
       check_subprocess(gvanno_vcfanno_command)
       logger.info("Finished")