Clinically reportable structural variant calls
Jean Monlong - Lead, Liaison
Rupesh Kesharwani - Sysadmin and code developer
Pranav Khade - Data Guru and curator
Weiyu Zhou - App Developer
Ahmad Al Khleifat - Data Guru and Writer
- Google Slide live deck
- Local backup: SV-Clinic.odp
Write a workflow and/or app to annotate structural variants (SVs) calls with clinically relevant information.
Eventually, the workflow could call the SVs from sequencing reads (e.g. from a BAM file) and extract some QC information from raw reads.
- Example file on our test data:
test.clinical.sv.csv
The table contains the following columns.
name | description |
---|---|
gene | names of genes overlapped, separated by | |
variant_id | SV ID |
chr | chromosome name |
start | start position |
end | end position |
size | size of the SV in bp |
frequency | allele frequency |
svtype | type of SV. E.g. DEL, DUP, INS, ... |
clinsv | dbVar accession IDs of matching known clinical SVs (separated by | |
clinrk | clinical importance rank, for example to select top 5 SVs |
We will also run a gene set enrichment and highlight SVs in enriched pathways/diseases. This is will represent a set of patient-level metrics. The output will be a set of graphs (image files)
Install R packages. In R:
if(!"easypackages" %in% row.names(installed.packages())){
install.packages("BiocManager", repos = "https://cloud.r-project.org")
library(easypackages, character.only = TRUE, quietly = TRUE)
}
pkgs=c("shiny","tippy","shinythemes", "tidyverse", "tidygraph", "clusterProfiler","org.Hs.eg.db","DOSE","ggnewscale","cowplot","tidyverse","plyr","ReactomePA","reactome.db","reactome.db","KEGG.db","enrichplot","dplyr","GenomicRanges", "rtracklayer", "VariantAnnotation", "jmonlong/sveval")
suppressWarnings(suppressMessages(easypackages::packages(pkgs, prompt = FALSE)))
Rscript R/prepare_annotation_data.R annotation_data.RData ## download and prepare annotations (makes 'annotation_data.RData')
Rscript R/annotate_vcf.R input.vcf annotation_data.RData output.vcf output.csv
For example, for the test data
Rscript R/annotate_vcf.R testdata/test.input.vcf annotation_data.RData testdata/test.clinical.sv.vcf testdata/test.clinical.sv.csv
cd demo_output
Rscript ../R/geneFunctionalAnnotation.R listofENSEMBLID.txt 0.05 EMSEMBL
cd R/
Rscript ../R/GeneAnnotationFromCSV.R -h
USAGE: Rscript GeneAnnotationFromCSV.R <CSV output from annotate_vcf.R> <pvalueCutoff> <svtype> <chrom>
Example: Rscript ../R/GeneAnnotationFromCSV.R ../testdata/test.output.csv 0.1 DUP chr12
Two ways to use our tools:
- A command-line workflow that can do
- SV discovery
- Annotation
- Automated QC graphs of supporting reads
- An interactive app to annotate a VCF with SV calls and visualize the results.
- Building on GeneVar but now the user can upload their own VCF
If needed SVs can be called using parliament2. We will provide commands to run this variant discovery and integrate it to the workflow
The different modules of the annotation are written as functions and saved in separate files.
Then the master annotation script can read a VCF, source these functions and use them to annotate the SVs.
See the current master annotation script annotate_vcf.R
and the different scripts sourced inside.
The server.R
file of the app can also use the same approach: source the same functions and use them to annotate a VCF before displaying the info in the app.
The resources used in the modules are downloaded and prepared by the prepare_annotation_data.R
.
-
annotate_genes.R
- Gencode v35
- Filter: keeps only SVs overlapping CDS regions.
- Uses a "wide" format: new field GENE lists all genes overlapped separated by
|
.
-
annotate_frequency.R
- Uses the position and the SVLEN, SVTYPE fields to match SVs in the gnomAD-SV catalog (i.e. make sure the SV calls contains these fields).
- Lenient matching criteria: 10% minimum reciprocal overlap; insertions can be as distant as 100 bp.
- New field AF reports the maximum frequency across all the SVs matched in gnomAD-SV.
-
annotate_known_clinical_SVs.R
- Known clinical SVs: dbVar nstd102 study
-
annotate_clinical_score.R
- ranks variants based on custom score
- a SV ranks higher if:
- matches known clinical SVs
- impact gene with predicted loss-of-function intolerance (pLI>.9)
- impact gene from OMIM
- break ties using the number of matched known clinical SVs, then number of genes affected
-
geneFunctionalAnnotation.R
- Need a list of genes similar to
demo/listofENSEMBLID.txt
(Other than ENTREZ Gene ID; such as SYMBOL / REFSEQ / ENSEMBL) to output a three types of Disease Ontology plots such as(demo/geneAnnotation.png)
including a barplot (high level catogory), a dot plot (show upto 20 diseases association) and a disease-gene network graph. - The list of genes can be extracted from annotated vcf based on any SV types.
- Need a list of genes similar to
- Known SVs in cancer from COSMIC?
We could show the reads around a SV as a quality control if the BAM is available. Either invent our own graph (for example in python), or run an existing tool like samplot.
A 1000GP sample was SV called in python_scripts
using parliament2.
For testing, we inject a SV that looks a bit like a known deletion in familial ovarian cancer.
grep "#" python_scripts/NA19461.final.manta.diploidSV.vcf > testdata/test.input.vcf
grep -v "#" python_scripts/NA19461.final.manta.diploidSV.vcf | awk '{if($7=="PASS"){print $0}}' >> testdata/test.input.vcf
cat testdata/add-for-test.vcf >> testdata/test.input.vcf