The code organized in this repo recapitulates the computational steps for the assembly and curation of the Viral Sequence Clusters (VSCs).
The code provided here is not optimized for universal use and is released for information and reproducibility puroposes only. This code cannot be run "as is" and needs to be adapted to your storage and computational architecture to be used.
The entry point of the pipeline is a set of viromes (Viral-Like-Particle enrichednriched) that are highly enriched for viruses. After several steps of clustering and annotation, we release a set of ~45k refined and de-replicated sequences from viromes and metagenomes.
The final set of representative sequences is available for future studies here. The raw set of non-dereplicated sequences is also available at the same location.
Viromes with a ViromeQC enrichment score > 50 are assembled into contigs and screened with this pipeline (asemble_sample.sh
)
- Reads are cleaned to remove low quality reads with Trim-Galore
- Human hg19 removal (Bowtie2)
- split-and-sort.py is used to recover read-pairs and put them in the correct order
- Metagenomic assembly performed with (contigs longer than 500 bp are kept)
- Spades (if the original reads are paired-end)
- Megahit (if the original reads are single-end)
- Contigs are fed into Prokka with
--kingdom Viruses
We use launch_assembly_on_dataset.sh
to submit jobs to the HPC cluster.
In this step, the assembled contigs are mapped against a list of interesting targets (e.g. known viruses, viral proteins...) and unwanted targets (e.g. bacterial genomes, MAGs)
The blast_contig.sh
script organize the mappings of each reconstructed contig against:
- NCBI80k (List of 80,000 reference genomes from NCBI)
- SGBs (The database of 154k genomes from Pasolli et al., 2019 plus all the unbinned contigs not released in the original study)
- Viral Genomes (from RefSeq release 91)
Each reconstructed contig is mapped with hmm_contig.sh
against the following models:
A set of three scripts (do_bread.sh
, bread.py
, bread_unify.py
) analyzes the BLAST outputs against multiple databases and, for each contig, report the best match for each db.
For each contig and each database (i.e. NCBI80k, SGBs and RefSeq_Viral_Genomes), the script reports:
- The best hit
label
,perc. of identity
,bitscore
andalignment length
- The total
breadth of coverage
against that database - The best hit is the one with the highest bitscore
- BLAST hits are filtered for identity (> 80%) and length (> 1000)
- The overall breadth of coverage is calculated by merging together all the hits that align on the contig. The script hence reports the overall breadth.
This step of the pipeline uses analyze_contigs.py
and analyze_contigs_merge.py
to process the contigs assembled from viromes and to structure the data available in different files for each contig.
analyze_contigs.py
can be parallelized on multiple FASTA files to speed up the process. analyze_contigs_merge.py
unifies the data from different runs of analyze_contigs.py
.
The resulting output is a (filtered) selection of all the assembled contigs, merged with:
- The metadata for each sample
- Information on the best hit in each of the bacterial/viral databases (RefSeq, SGBs etc)
- a set of statistics on the prevalence of each contig across metagenomes and virome datasets (including the >9,000 metagenoms described in Pasolli et al., 2019, and the screened viromes described in Zolfo et al, 2019)
Finally, extract_contigs_from_vdb_report.py
reads the filtered contigs list and extracts the associated sequences in FASTA
format.
This steps clusters the filtered contis, then runs a BLAST search on assembled metagenomes and viromes to retrieve homologues in there. The clustering is then performed again to produce multiple sequence alignments.
- Contigs from highly-enriched viromes are clustered with vsearch at 90% identity (i.e. high enrinchment contigs)
- Contigs are then mapped against metagenomes (n~=9000) and viromes (n~=3100) to find homologous sequences, producing an extended contigs repertoire (i.e. extended contigs)
- The extended contigs are then compared to the centroids of the initial clustering with mash. Contigs with a distance < 10% are kept.
- A second clustering is performed internally within each cluster, by adding the extended contigs from metagenomes and viromes to each cluster. Contigs that still fall in the same cluster of the high enrinchment contigs are kept. (i.e. the final clusters)
- Within each final cluster, only sequences with a length +/- 20% of the median length of thee cluster go through MSA and a phylogenetic tree is produced.
- Within each final cluster, sequences are selected (a 3rd clustering is performed at 95%, only centroids are kept) and taken as VSCs representatives.
Internal Scripts to launch jobs on the PBS HPC cluster of the University of Trento are named *_launcher
. These scripts gradually submit jobs to the HPC cluster and tack the execution of each job.
examples are:
vdb_contig_filtering/hpc_launchers/vdb_build_launcher.sh
vdb_contig_filtering/hpc_launchers/vdb_build_launch.sh
vdb_assembly/launch_assembly_on_dataset.sh
Edits FASTA and FASTQ files to modify Sequences IDs (e.g. to propagate metadata and information through the various step of the pipeline)
Creates a separate FASTA file for each entry of a bigger FASTA file
Splits reads into R1
and R2
files, then checks intact read-pairs and puts singletons into UP
. This ensures that all the quality-filtered reads are used to produce an assembly, and that unpaired reads are not discarded.