0.9.0 release

sigven · May 21, 2019 · 2640ad8 · 2640ad8
1 parent 8b5e122
commit 2640ad8
Show file tree

Hide file tree

Showing 11 changed files with 782 additions and 171 deletions.
diff --git a/README.md b/README.md
@@ -6,20 +6,24 @@ The germline variant annotator (*gvanno*) is a simple, Docker-based software pac
 
 *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
 
-#### Annotation resources included in _gvanno_ - 0.8.0
+#### Annotation resources included in _gvanno_ - 0.9.0
 
-* [VEP v95](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor (GENCODE v29/v19 as the gene reference dataset)
-* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (February 2019)
-* [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP
-* [dbSNP build 151](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (October 2017) - from VEP
+* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v96 (GENCODE v30/v19 as the gene reference dataset)
+* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.0, May 2019)
+* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
+* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 151, October 2017) - from VEP
 * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar 20190305](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (March 2019)
+* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (May 2019)
 * [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
-* [UniProt/SwissProt KnowledgeBase 2019_02](http://www.uniprot.org) - Resource on protein sequence and functional information (February 2019)
-* [Pfam v32](http://pfam.xfam.org) - Database of protein families and domains (Sept 2018)
+* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2019_04, May 2019)
+* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v32, Sept 2018)
 * [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 13th 2019)
 
 ### News
+* May 21st 2019 - **0.9.0 release**
+     * Data bundle updates: ClinVar, UniProt
+	* Adding gene-disease associations from [Open Targets Platform](https://targetvalidation.org),([Carvalho-Silva et. al, NAR, 2019](https://www.ncbi.nlm.nih.gov/pubmed/30462303))
+	* Moved *vcf-validation* configuration to command-line option
 * March 21st 2019 - **0.8.0 release**
      * Data bundle updates: ClinVar, UniProt, GWAS catalog
      * Bundle bug: Missing VEP FASTA file for grch38
@@ -73,15 +77,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Download and unpack the [latest software release (0.8.0)](https://github.com/sigven/gvanno/releases/tag/v0.8.0)
+1. Download and unpack the [latest software release (0.9.0)](https://github.com/sigven/gvanno/releases/tag/v0.9.0)
 2. Download and unpack the assembly-specific data bundle in the gvanno directory
-   * [grch37 data bundle](https://drive.google.com/file/d/1cJRaSD_UgeG34CnE3PHj3vxXSAAMN9Jl) (approx 14Gb)
-   * [grch38 data bundle](https://drive.google.com/file/d/1uZw5iEibKJV_9SmCusHcpzKBVzTu2pcH) (approx 15Gb)
+   * [grch37 data bundle](https://drive.google.com/open?id=1rqkzHTmPpBsVY3MvzCQdJurKuCDNf09D) (approx 14Gb)
+   * [grch38 data bundle](https://drive.google.com/open?id=13pn59FpLU7Tta7X16H2GKkOfbsclPi9I) (approx 15Gb)
    * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
     A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (0.8.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
-   * `docker pull sigven/gvanno:0.8.0` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (0.9.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
+   * `docker pull sigven/gvanno:0.9.0` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -95,42 +99,35 @@ We __strongly__ recommend that the input VCF is compressed and indexed using [bg
 
 A few elements of the workflow can be figured using the *gvanno* configuration file (i.e. **gvanno.toml**), encoded in [TOML](https://github.com/toml-lang/toml) (an easy to read file format).
 
-* The initial step of the workflow performs [VCF validation](https://github.com/EBIvariation/vcf-validator) on the input VCF file. This procedure is very strict, and often causes the workflow to return an error due to various violations of the VCF specification. If the user trusts that the most critical parts of the input VCF is properly encoded,  a setting in the configuration file (`vcf_validation = false`) can be used to turn off VCF validation.
-
 * Prediction of loss-of-function variants using VEP's LOFTEE plugin can be turned on in the configuration file (`lof_prediction = true`). Do note that this frequently increases the run time for VEP significantly.
 
 #### STEP 5: Run example
 
 Run the workflow with **gvanno.py**, which takes the following arguments and options:
 
-	usage: gvanno.py [-h] [--force_overwrite] [--version]
-			  query_vcf gvanno_dir output_dir {grch37,grch38} configuration_file
-			  sample_id
+		usage: gvanno.py [options] <QUERY_VCF> <GVANNO_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY> <CONFIG_FILE> <SAMPLE_ID>
 
-	Germline variant annotation (gvanno) workflow for clinical and functional
-	interpretation of germline nucleotide variants
+		Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants
 
-	positional arguments:
-	query_vcf			VCF input file with germline variants (SNVs/InDels)
-	gvanno_dir            gvanno base directory with accompanying data
-				    directory, e.g. ~/gvanno-0.8.0
-	output_dir            Output directory
-	{grch37,grch38}       grch37 or grch38
-	configuration_file    gvanno configuration file (TOML format)
-	sample_id             Sample identifier - prefix for output files
+		positional arguments:
+		query_vcf           VCF input file with germline query variants (SNVs/InDels)
+		gvanno_dir          gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.9.0
+		output_dir          Output directory
+		{grch37,grch38}     grch37 or grch38
+		configuration_file  gvanno configuration file (TOML format)
+		sample_id           Sample identifier - prefix for output files
 
-	optional arguments:
-	-h, --help            show this help message and exit
-	--force_overwrite     The script will fail with an error if the output file
-				    already exists. Force the overwrite of existing result
-				    files by using this flag (default: False)
-	--version             show program's version number and exit
+		optional arguments:
+		-h, --help          show this help message and exit
+		--force_overwrite   The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag
+		--version           show program's version number and exit
+		--no_vcf_validate   Skip validation of input VCF with Ensembl's vcf-validator
 
 
 The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
 
-`python ~/gvanno-0.8.0/gvanno.py  ~/gvanno-0.8.0/examples/example_grch37.vcf.gz`
-` ~/gvanno-0.8.0 ~/gvanno-0.8.0/examples grch37 ~/gvanno-0.8.0/gvanno.toml example`
+`python ~/gvanno-0.9.0/gvanno.py  ~/gvanno-0.9.0/examples/example_grch37.vcf.gz`
+` ~/gvanno-0.9.0 ~/gvanno-0.9.0/examples grch37 ~/gvanno-0.9.0/gvanno.toml example`
 
 
 This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:

diff --git a/gvanno.py b/gvanno.py
@@ -10,20 +10,23 @@
 import getpass
 import platform
 import toml
+from argparse import RawTextHelpFormatter
 
 
-gvanno_version = '0.8.0'
-db_version = 'GVANNO_DB_VERSION = 20190320'
-vep_version = '95'
+
+gvanno_version = '0.9.0'
+db_version = 'GVANNO_DB_VERSION = 20190521'
+vep_version = '96'
 global vep_assembly
 
 def __main__():
 
-   parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+   parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=RawTextHelpFormatter, usage="%(prog)s [options] <QUERY_VCF> <GVANNO_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY> <CONFIG_FILE> <SAMPLE_ID>")
    parser.add_argument('--force_overwrite', action = "store_true", help='The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag')
    parser.add_argument('--version', action='version', version='%(prog)s ' + str(gvanno_version))
+   parser.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator")
    parser.add_argument('query_vcf', help='VCF input file with germline query variants (SNVs/InDels)')
-   parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.8.0')
+   parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.9.0')
    parser.add_argument('output_dir',help='Output directory')
    parser.add_argument('genome_assembly',choices = ['grch37','grch38'], help='grch37 or grch38')
    parser.add_argument('configuration_file',help='gvanno configuration file (TOML format)')
@@ -53,7 +56,7 @@ def __main__():
       gvanno_error_message(err_msg,logger)
    host_directories = verify_input_files(args.query_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)
 
-   run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, gvanno_version)
+   run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.no_vcf_validate, args.genome_assembly, gvanno_version)
 
 
 def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger):
@@ -78,7 +81,7 @@ def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger)
       gvanno_error_message(err_msg, logger)
 
 
-   boolean_tags = ['vep_skip_intergenic', 'vcf_validation', 'lof_prediction']
+   boolean_tags = ['vep_skip_intergenic', 'lof_prediction']
    integer_tags = ['n_vcfanno_proc','n_vep_forks','buffer_size']
    for section in ['other']:
       if section in user_options:
@@ -246,7 +249,7 @@ def getlogger(logger_name):
 
    return logger
 
-def run_gvanno(host_directories, docker_image_version, config_options, sample_id, genome_assembly, gvanno_version):
+def run_gvanno(host_directories, docker_image_version, config_options, sample_id, no_vcf_validate, genome_assembly, gvanno_version):
    """
    Main function to run the gvanno workflow using Docker
    """
@@ -256,7 +259,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
    output_pass_vcf = 'None'
    uid = ''
    vep_assembly = 'GRCh38'
-   gencode_version = 'release 29'
+   gencode_version = 'release 30'
    if genome_assembly == 'grch37':
       gencode_version = 'release 19'
       vep_assembly = 'GRCh37'
@@ -272,6 +275,9 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       uid = 'root'
 
    vepdb_dir_host = os.path.join(str(host_directories['db_dir_host']),'.vep')
+   vcf_validation = 1
+   if no_vcf_validate:
+      vcf_validation = 0
    data_dir = '/data'
    output_dir = '/workdir/output'
    vep_dir = '/usr/local/share/vep/data'
@@ -284,17 +290,31 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
    if host_directories['input_conf_basename_host'] != 'NA':
       input_conf_docker = '/workdir/input_conf/' + str(host_directories['input_conf_basename_host'])
 
-   docker_command_run1 = 'NA'
+   vep_volume_mapping = str(vepdb_dir_host) + ":/usr/local/share/vep/data"
+   databundle_volume_mapping = str(host_directories['base_dir_host']) + ":/data"
+   input_vcf_volume_mapping = str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf"
+   input_conf_volume_mapping = str(host_directories['input_conf_dir_host']) + ":/workdir/input_conf"
+   output_volume_mapping = str(host_directories['output_dir_host']) + ":/workdir/output"
+
+   docker_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" +  str(databundle_volume_mapping) + " -v=" + str(vep_volume_mapping) + " -v=" + str(input_conf_volume_mapping) + " -v=" + str(output_volume_mapping)
    if host_directories['input_vcf_dir_host'] != 'NA':
-      docker_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(vepdb_dir_host) + ":/usr/local/share/vep/data -v=" + str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf -v=" + str(host_directories['input_conf_dir_host']) + ":/workdir/input_conf -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir/output " + str(docker_image_version) + " sh -c \""
-   docker_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir " + str(docker_image_version) + " sh -c \""
+      docker_command_run1 = docker_command_run1  + " -v=" + str(input_vcf_volume_mapping)
+
+   docker_command_run1 = docker_command_run1 + " -w=/workdir/output " + str(docker_image_version) + " sh -c \""
+   docker_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" +  str(databundle_volume_mapping) + " -v=" + str(output_volume_mapping)
+   docker_command_run2 = docker_command_run2 + " -w=/workdir/output " + str(docker_image_version) + " sh -c \""
    docker_command_run_end = '\"'
 
-
+   logger = getlogger("gvanno-start")
+   logger.info("--- germline variant annotation (gvanno) workflow ----")
+   logger.info("Sample name: " + str(sample_id))
+   logger.info("Genome assembly: " + str(genome_assembly))
+   print()
+
    ## verify VCF and CNA segment file
    logger = getlogger('gvanno-validate-input')
    logger.info("STEP 0: Validate input data")
-   vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(genome_assembly) + docker_command_run_end
+   vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(vcf_validation) + " "  + str(genome_assembly) + docker_command_run_end
 
    check_subprocess(vcf_validate_command)
    logger.info('Finished')

diff --git a/gvanno.toml b/gvanno.toml
@@ -1,11 +1,6 @@
 # gvanno configuration options (TOML).
 
 [other]
-## Keep/skip VCF validation by https://github.com/EBIvariation/vcf-validator. The vcf-validator checks
-## that the input VCF is properly encoded. Since the vcf-validator is strict, and with error messages
-## that is not always self-explanatory, the users can skip validation if they are confident that the
-## most critical parts of the VCF are properly encoded
-vcf_validation = true
 ## Number of processes for vcfanno
 n_vcfanno_proc = 4
 ## Number of forks for VEP