Skip to content

Latest commit

 

History

History
79 lines (55 loc) · 6.72 KB

annotation_resources.md

File metadata and controls

79 lines (55 loc) · 6.72 KB

PCGR annotation resources

Basic variant consequence annotation

  • VEP - Variant Effect Predictor release 96 (GENCODE v30 as gene reference database (v19 for grch37))

Insilico predictions of effect of coding variants

  • dBNSFP - database of non-synonymous functional predictions (v4.0, May 2019)

Variant frequency databases

  • gnomAD - germline variant frequencies exome-wide (r2.1, October 2018)
  • dbSNP - database of short genetic variants (b151)
  • Cancer Hotspots - a resource for statistically significant mutations in cancer (v2, 2017)
  • TCGA - somatic mutations discovered across 33 tumor type cohorts (release 16.0, March 2019)
  • ICGC-PCAWG - ICGC Pancancer Analysis of Whole Genomes - (release 28, March 17th, 2019)

Variant databases of clinical utility

  • ClinVar - database of clinically related variants (May 2019)
  • DoCM - database of curated mutations (v3.2, April 2016)
  • CIViC - clinical interpretations of variants in cancer (May 18th 2019)
  • CBMDB - Cancer BioMarkers database (January 17th 2018)
  • DGIdb - database of targeted antineoplastic drugs (v3.0.2, January 2018)
  • ChEMBL - database of drugs, drug-like small molecules and their targets (ChEMBL_25, March 2019)

Protein domains/functional features

  • UniProt/SwissProt KnowledgeBase - resource on protein sequence and functional information (2019_04, May 2019)
  • Pfam - database of protein families and domains (v32, September 2018)

Knowledge resources on gene and protein targets

  • CancerMine - Literature-mined database of tumor suppressor genes/proto-oncogenes (v12, May 2019)
  • Open Targets Platform - Database on disease-target associations and target tractability aggregated from multiple sources (literature, pathways, mutations) (2019_04)
  • DisGeNET - curated associations between human genes and different tumor types (v6.0, January 2019)
  • TCGA driver genes - predicted cancer driver genes based on application of multiple driver gene prediction tools on TCGA pan-cancer cohort

Pathway databases

Notes on variant annotation datasets

Genome mapping

A requirement for PCGR variant annotation datasets is that variants have been mapped unambiguously to the reference human genome. For most datasets this requirement is not an issue (i.e. dbSNP, ClinVar etc.). A fraction of variants in the annotation datasets related to clinical interpretation, CIViC and CBMDB, has however not been mapped to the genome. Whenever possible, we have utilized TransVar to identify the actual genomic variants (e.g. g.chr7:140453136A>T) that correspond to variants reported at the amino acid level or with other HGVS nomenclature (e.g. p.V600E).

For variants that have been mapped to a specific build (GRCh37/GRCh38), we have utilized the crossmap package to lift the datasets to the other build.

Data quality

Clinical biomarkers

Clinical biomarkers included in PCGR are limited to the following:

  • Evidence items for specific markers in CIViC must be accepted (submitted evidence items are not considered)
  • Markers reported at the variant level (e.g. BRAF p.V600E)
  • Markers reported at the codon level (e.g. KRAS p.G12)
  • Markers reported at the exon level (e.g. KIT exon 11 mutation)
  • Within the Cancer bioMarkers database (CBMDB), only markers collected from FDA/NCCN guidelines, scientific literature, and clinical trials are included (markers collected from conference abstracts etc. are not included)
  • Copy number gains/losses

See also comment on a closed GitHib issue

Antineoplastic drugs

  • For drugs extracted from DGIdb, we only include antineoplastic drugs subject to direct interaction with a target (i.e. as recorded in ChEMBL)

Gene-disease associations

  • For gene-disease associations extracted from DisGeNET, we require a score greater than 0.2 and that the association is suppported by at least one PMID (PubMed article). Associations involving non-cancer type of diseases are not included.
  • Cancer phenotype associations retrieved from the Open Targets Platform are largely based on the association score developed by the Open Targets Platform, with a couple of extra post-processing steps:
    • Phenotype associations in Open Targets Platform are assembled from 20 different data sources. Target-disease associations included in PCGR must be supported by at least two distinct sources
    • The weakest associations, here defined as those with an association score < 0.4 (scale from 0 to 1), are ommitted
    • As is done within the Open Targets Platform, association scores (for genes) are represented with varying shades of blue: the darker the blue, the stronger the association. Variant hits in tier 3/4 and the noncoding section are arranged according to this association score. If several disease subtypes are associated with a gene, the maximum association score is chosen.

Tumor suppressor genes/proto-oncogenes

  • For liteature-derived predictions of tumor suppressor genes/proto-oncogenes from CancerMine, we require a minimum of four PubMed hits.

TCGA somatic calls

  • TCGA employs four different variant callers for detection of somatic variants (SNVs/InDels): mutect2, varscan2, somaticsniper and muse. In the TCGA dataset bundled with PCGR, somatic SNVs are restricted to those that are detected by at least two independent callers (i.e. calls found by a single algorithm are considered low-confident and disregarded)