Skip to content

collaborativebioinformatics/coronasv

Repository files navigation

CoronaSV

logo

Hackathon team: Daniel Agustinho, Daniela Soto, Max Marin, Shangzhe Zhang, Todd Treangen, Yunxi Liu, Arda Soylev

Final Presentation

Awesome Final Presentation

What's the problem?

Deletions have been reported in several SARS-CoV-2 genomes, primarily detected at the consensus/assembly level. The level of confidence in how these deletions are detected have not yet thoroughly been evaluated. Existing methods for detecting structural variation at the individual read level often suffer from false positive calls. Additionally analysis with different SV calling pipelines often result in inconsistent calls. In order to examine the landscape and extent of structural variation across SARS-Cov-2 genomes, a method for generating accurate and trustworthy SV calls is needed. With this in mind, we developed the CoronaSV bioinformatics pipeline.

CoronaSV is a structural variation detection and validation pipeline for SARS-CoV-2 that combines an ensemble of structural variant calling approaches using both long read and short read sequencing technologies. Both assembly based and read based structural variant detection detection methods are used by CoronaSV. By combining different sequencing technologies and variant detection approaches, we can identify both a) confident SV calls and b) artifacts that may result from specific technologies + computational approaches.

What is CoronaSV?

CoronaSV is an SV detection and validation pipeline for SARS-CoV-2 sequencing data (Illumina Paired-End & Oxford Nanopore Long Read Sequencing). CoronaSV takes both short and long read datasets as input, followed by the quality control step that does quality trimming and engineering sequence removal. CoronaSV incorporates both reference guided and de novo assembly approaches, and makes high confident SV calls by combining results from multiple state of the art SV callers using SURVIVOR.

Installation of CoronaSV

All of the software packages used by CoronaSV can be installed via the Conda package manager. Additionally, the CoronaSV workflow is defined using Snakemake. Running the CoronaSV.smk snakemake pipeline handles downloading all specified data and processing of sequencing data to variant calls.

Installing CoronaSV from Github (Using conda)

Clone the CoronaSV Github repository, and then use conda to create an environment with all needed software. This Conda environment includes all of the core software used by the pipeline + the Snakemake workflow management system.

# Clone Git Repo
git clone https://github.com/collaborativebioinformatics/coronasv.git

cd ./coronasv/

# Create an environment for CoronaSV
conda env create -f ./Envs/CoronaSV_V1.yml  -n CoronaSV

# Activate CoronaSV environment
conda activate CoronaSV

That's it! You should now have the CoronaSV environment activated.

A quick example of how to run CoronaSV (Using SnakeMake)

The example below runs CoronaSV on all SRA Run Accessions defined in the 'Metadata_TSV' file.

In this case the Metadata TSV is defined as './runInfo_TSVs/CoronaSV_metadata_TestSubset_1_Nanopore_1_Illumina.tsv', which contains 1 Nanopore sequencing run and 1 Illumina sequencing run of a SARS-Cov-2 isolate.

conda activate CoronaSV

# Enter "coronasv" git repository directory
cd ./coronasv/

# Define configuration files
input_ConfigFile="./SMK_config_V1.txt"

input_SampleInfo_TSV="./Metadata_TSVs/CoronaSV_metadata_TestSubset_1_Nanopore_1_Illumina.tsv"


# DEFINE the output directory of the CoronaSV pipeline

target_Output_Dir="../CoronaSV_Analysis_TestSubset1_OutputDir"

mkdir ${target_Output_Dir}

# Run 
snakemake -s CoronaSV_V1.smk --config output_dir=${target_Output_Dir} inputSampleData_TSV=${input_SampleInfo_TSV} --configfile ${input_ConfigFile} -p --use-conda --cores 4 

If you would like to run CoronaSV on all samples identified in our metadata file, change the definition of the 'input_SampleInfo_TSV' bash variaible:

input_SampleInfo_TSV="./Metadata_TSVs/CoronaSV_metadata_TestSubset_1_Nanopore_1_Illumina.tsv"

Overview of available data

Sequencing data from multiple sequencing technologies and library prep strategies are available. A summary of out initial dataset can be found in: 201013_CoronaSV_Metadata_V1.tsv.

CoronaSV Pipeline Overview

workflow

Data download

All sequencing read data is queried and downloaded from SRA using the SRAtoolkit.

SV calling from short-reads

A) Filtering

A.1) trimmomatic was used to remove adapters and low-quality bases from short-reads.

A.2) After mapping, PCR duplicates were removed using Picard MarkDuplicates

B) Mapping

Short-reads were mapped to SARS-CoV-2 reference using bwa mem.

C) SV calling from short-reads

C.1) Manta

Manta identifies deletions, duplications, inversions or translocations in paired-end short-read sequencing using paired-end and split-read mapping information.

C.2) Delly

Delly uses a combination of paired-ends, split-reads and read-depth signatures to detect deletions, tandem duplications and translocations at single-nucleotide resolution.

C.3) Lumpy

Lumpy integrates multiple SV signals (read-pair, split-read, read-depth) to identify deletions, tandem duplications, inversions and translocations in short-read sequencing.

C.4) Tardis

Tardis uses multiple SV signatures such as read-pair, read-depth and split-read to discover various SV types using paired-end Illumina data. These include deletions, inversions, MEIs, tandem and interspersed duplications in forward and reverse orientations.

SV calling from de novo assemblies

A) Assembly Software

A.1) UniCycler

Description: UniCycler will be used to assemble Illumina (short read) data into consensus genome sequence. UniCycler uses the SPAdes assembler internally with additional optimization steps.

B) Structural Variant Calling (Assembly alignment to reference)

B.1) NucDiff

NucDiff will perform alignment of all assemblies to the reference genome, and detect SV based on this assembly to reference alignment.

B.2) SVanalyzer - SVrefine

The SVdefine pipeline of the SVanalyzer package will be used to call and describe SVs relative to the reference using the NUCmer alignments generated by NucDiff as input.

B.3) Minimap2 Minimap2 was used to align the completed short read assemblies to the reference genome. The paftools utility of minimap2 was used to produce variant calls in VCF format after alignment.

SV calling from long-reads

A) Filtering

A.1) NanoPlot was used to plot the quality of long-read sequences.

A.2) Nanofilt was used to filter long reads on quality and/or read length, and optional trim after passing filters.

B) Mapping

Long-reads was mapped to reference genome by SARS-CoV-2 reference by minimap2

C) SV calling for long-reads

C.1) Sniffles

Sniffles detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

C.2) SVIM

SVIM is able to detect, classify and genotype five different classes of structural variants. SVIM works better on integrating information from across the genome to precisely distinguish similar events, such as tandem and interspersed duplications and simple insertions.

C.3) cuteSV

cuteSV uses tailored methods to collect the signatures of various types of SVs and employs a clustering-and-refinement method to analyze the signatures to implement sensitive SV detection.

SV datasets integration

SURVIVOR is a tool set for simulating/evaluating SVs, merging and comparing SVs within and among samples, and includes various methods to reformat or summarize SVs.

SV callsets were compared and integrated using SURVIVOR.

First, to detect SVs consistently within the three different methods (short reads, long reads or assembly), SURVIVOR is used to filter SVs that were called by at least two different tools. That way, three different VCF files for each kind of SV analyzed (Deletion, Duplication or Inversion) are generated, one for each of the methods.

Next, SURVIVOR is used again between each one of the methods to generate one final file for each sample, for each kind of SV.

SURVIVOR usage: ./SURVIVOR merge 200 2 1 0 0 1 VCFfiles

To determine SVs consistently found in the population, we used SURVIVOR between all the samples, looking for SVs that were present in the majority of the samples ( x=(n/2)+1, where n is the number of samples). The same procedure was used for each kind of SVs (see paragraph above).

SURVIVOR usage: ./SURVIVOR merge 200 x 1 0 0 1 VCFfiles

Results

SV distribution of long-read data

longhh-reads

SV distribution of shor-read data

short-reads

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published