Skip to content

Latest commit

 

History

History
732 lines (474 loc) · 40.2 KB

README.md

File metadata and controls

732 lines (474 loc) · 40.2 KB

alt text

DNAscanv2

+TO BE NOTED: We are always working to improve DNAscan so any bug reports, suggestions and general feedback would be highly welcome. 

Table of Contents

  1. Introduction
  2. Citation
  3. Documentation
  4. Core Contributors
  5. Contributing
  6. Licence

Introduction

DNAscan2 is a fast and efficient bioinformatics pipeline that allows for the analysis of DNA Next Generation Sequencing (NGS) data, requiring very little computational effort and memory usage. DNAscan2 can analyse 40x whole genome NGS data using as little as 4 cpus and 16 Gbs of RAM while guaranteeing a very high performance. We do this by exploiting state-of-the-art bioinformatics tools. DNAscan can screen your DNA NGS data for single nucleotide variants, small indels, structural variants, transposable elements, known and non-reference repeat expansions and viral (or any other organism’s) genetic material. Its results are annotated using a wide range of databases including refGene, ClinVar, EXAC, dbSNP, CADD, and gnomAD and 1000g for variant frequency estimation, and can be uploaded onto the gene.iobio platform for an on-the-fly analysis/interpretation or visualised via several reports generated by DNAscan2.

alt text

Figure 1. Central panel: Pipeline overview. DNAscan2 accepts sequencing data and variant files in various formats. The alignment step aligns raw genomic reads with HISAT2 for the identification of all variant types. Resulting soft/hard-clipped reads and unaligned reads are realigned with BWA mem and merged with the HISAT2 reads using samtools if the user specifies that they want to call structural variants, transposable elements or short tandem repeats. This is followed by a customizable data analysis protocol (details in the left panel). Finally, results are annotated and a user friendly report is generated (details in the right panel). Left panel: detailed description of the post alignment analysis pipeline. Aligned reads are used by the SNV and indel variant calling pipeline (Strelka), and both aligned and unaligned reads are used by Manta, Delly, ExpansionHunter (for which known repeat description files have to be provided) and ExpansionHunter Denovo to detect and genotype structural variants (SV), and by MELT for transposable elements (MEI) and human endogenous retrovirus detection (hg38 only). SV and MEI results are merged using SURVIVOR to allow for easier downstream reporting. Steps with red dotted lines (Delly SV calling and ExpansionHunter genotyping of short tandem repeat loci) can be removed if the -fast_mode flag is set, in order to increase computational efficiency for those who have limited computational resources. Left panel: Annotation and report generation description. SNV/indel variants and both known and novel repeat expansions are annotated via ANNOVAR and its respective default or user-defined databases before a results report is created. SV and MEI variants are annotated with AnnotSV and a html report is generated by knotAnnotSV. An additional report is generated which details all called and annotated variants. Sequencing and alignment reports are also created with FastQC and samtools flagstat + samtools/bcftools stats. As before, the unaligned reads can also be mapped to a database of known viral genomes (NCBI database) to screen for the presence of their genetic material in the input sequencing data. Custom microbes and bacteria can also be analysed.

Citation

Original DNAscan publication: Alfredo Iacoangeli et al. DNAscan: personal computer compatible NGS analysis, annotation and visualisation. BMC Bioinformatics, 2019

DNAscan2: Coming soon...

Documentation

Minimum requirements

  • Ubuntu >= 14.04.
  • RAM: if processing WGS data, at least 14Gb.
  • Space required by the installation:
    • By the dependencies and DNAscan2: we recommend at least 15Gb of free space for the whole DNAscan2 deployment, including the dependencies.
    • 8Gb for the reference human genome and the hisat2 index. An extra 5.5Gb for the bwa index are also required.
    • The annotation step makes use of Annovar and selected databases, the size of which can range from tens of megabytes to hundreds of gigabytes (e.g. CADD database). If the user wishes to perform the annotation step they must take this into account. Without the CADD database, the other default annovar databases require ~ 90Gb free space, and if you wish to perform SV and MEI annotation, AnnotSV human annotations occupy ~ 4Gb.
  • Scratch space for usage: If you are performing the alignment stage and your input data is in fastq.gz format we recommend using at least 3 times the size of your input data. E.g. your fastq.gz files are 100Gb, thus you would need 300Gb of free space. If you don't wish to perform the alignment, a proportion of data-to-analyse:free-space of 1:1 would be enough. E.g. Input data is a 50Gb bam file, you would need only 50Gb of free space.
  • Space for usage: If performing transposable element detection, MELT requires an additional 20GB overhead storage.

Obtaining

Version: 2.0

Please make sure all dependencies are installed before running DNAscan2. Instructions on how to install all dependencies are available in the following chapter. However, a bash script to set up all dependencies (Annovar and MELT need a manual registration and download step) is available in scripts.

Local Deployment

To obtain DNAscan2 please use git to download the most recent development tree:

git clone https://github.com/hevmarriott/DNAscanv2.git

Once you have downloaded DNAscan2, you can set up all needed dependencies for either hg19 or hg38 by running the install_dependencies_(hg19/hg38).sh script available in DNAscanv2/scripts. Before running install_dependencies.sh please download Annovar by registering at the following link and MELTv2.2.2 by registering at the following link. Install_dependencies.sh will install all software dependencies, the selected reference genome and its hisat2 and bwa indexes (these jobs run in the background and will finish after the script ends (approximately 1 hour) and update paths_configs.py accordingly. The only requirement before running DNAscan2 is to edit paths_configs.py with the paths of the newly installed software dependencies.

bash scripts/install_dependencies_(hg19/hg38).sh /path/to/set_up/directory/ /path/to/DNAscanv2/directory/ /path/to/annovar.tar.gz /path/to/MELTv2.2.2.tar.gz $num_threads

source ~/.bashrc

NOTE: If you want to perform annotation with Annovar, the CADD database takes up ~50GB of storage - please comment this line out before running the install script if it is not required.

Obtain with Docker

The easiest way to get started with DNAscan2 is to use its Docker image, which has all dependencies and the reference genome build and BWA/Hisat2 indexes installed, except for Annovar and MELTv2.2.2. These are available from Docker Hub (hevmarriott/dnascan2_hg19 and hevmarriott/dnascan2_hg38).

IMPORTANT: if you want to use the mobile element insertion or annotation step of DNAscan2 you need to register and download both Annovar and MELTv2.2.2. Remember to mirror into the container the folder where you have deployed these using the -v flag or copy them inside the container using the docker cp command as described below.

sudo docker run  -v /path/to/your/data_folder:/container/path/where/you/want/your/data --storage-opt size=50G  -it hevmarriott/dnascan2_hg19 OR hevmarriott/dnascan2_hg38 /bin/bash

Please set the needed container size taking into account the "Minimum requirements".

The -v option adds your data folder (assuming you have some data to run DNAscan2 on). The --storage-opt size=Ngigas option defines the maximum size of the container, setting it to N gigabytes. This number would depend on you plans. If you want to perform annotation, the databases used by DNAscan (clinvar,CADD,etc) are about 350G. We recommend N = 500 if you want to install the whole pipeline (including annotation). To this number you should add what you need for your analysis, e.g. if you are planning to download data, the size of your data to analyse etc. A way to workaround this is to use the mirrored host folder as outdir for your analysis and as annovar folder. This folder does not have a size limit.

IMPORTANT: To detach from the container without stopping it, use Ctrl+p, Ctrl+q. IMPORTANT: When running DNAscan2 inside a docker container, if you want to use the iobio services (and for example upload your results into the gene.iobio platform), these would not be visible by your browser unless they are in the folder which is mirrored on the host system. Considering the previous command to run an ubuntu image, the easiest way to do this would be to use the folder where you imported the data inside the container (/container/path/where/you/want/your/data) as an outdir when running DNAscan. In this way the DNAscan2 results can be found in /path/to/your/data_folder on the host system.

If you want to add data to the container while this is already running you can use the docker cp command. First detach from the container without stopping it using Ctrl+p, Ctrl+q, then cp your data inside the container:

docker cp [OPTIONS] /path/to/your/data CONTAINER_ID:/container/path/where/you/want/your/data

and execute a bash shell inside your container:

docker exec -it CONTAINER_ID /bin/bash

The container ID can be found using the ps command:

docker ps 

Building a Docker container from scratch

You can easily set up your running deployment of DNAscan2 using docker. For instructions about how to install docker see following.

After installing docker run an Ubuntu image:

docker run -v /path/to/your/data_folder:/container/path/where/you/want/your/data  -it [--storage-opt size=500G] ubuntu /bin/bash 

Before you do this, make sure that you have registered and downloaded Annovar and MELTv2.2.2 if you want to run mobile element insertion detection and perform annotation.

Then install git, download this repository and run the install_dependencies_(hg19/hg38).sh script, which downloads the selected reference genome build and BWA/Hisat2 indexes as well as the Annovar databases:

sudo apt-get update

sudo apt-get install git

git clone https://github.com/hevmarriott/DNAscanv2.git

cd DNAscanv2

#By default install_dependencies.sh downloads the following Annovar databases: Exac, Refgene, Dbnsfp, Clinvar, Avsnp, Intervar, 1000g and gnomad211. If you wish to download the CADD database (about 350G) please uncomment the appropiate line. If you are not interested in performing annotation, please # the annovar lines in the install_dependencies.sh script. 

bash scripts/install_dependencies_(hg19/hg38).sh /path/to/set_up/directory/ /path/to/DNAscanv2/directory/ /path/to/annovar.tar.gz /path/to/MELTv2.2.2.tar.gz $num_threads

source ~/.bashrc

Usage

IMPORTANT: DNAscan.py is the main script performing the analyses. It must be in the same folder as paths_configs.py. Before running DNAscan2 please modify paths_configs.py to match your dependencies deployment. IMPORTANT2: All paths in DNAscan2 end with "/"

Its basic use requires the following options:

  -format FORMAT        options are bam, sam, cram, fastq, vcf [string] 
  -reference REFERENCE  options are hg19, hg38, grch37 and grch38 [string]
  -in INPUT_FILE        input file [string]
  -in2 INPUT_FILE       second input file (for paired end reads in fastq format only) [string]
  -out OUT              path to the output folder. It has to end in /" e.g. /home/user/local/test_folder/

The desired pipeline stages are performed according to the optional arguments selected:

 -filter_string	bcftools filter string for strelka, eg GQ>20 & DP>10 (Default = ""FORMAT/FT == "PASS" && FORMAT/DP > 10 && MQ > 40 && GQ > 20 && ID/SB < 2 && ADF > 0 && ADR > 0")
 -iobio                if this flag is set the iobio service links will be provided at the end of the analysis (Default = "False")
 -alignment            if this flag is set the alignment stage will be performed (Default = "False")
 -expansion            if this flag is set DNAscan2 will look for the expansions described in the json folder described in paths_configs.py with ExpansionHunter and non-reference/novel tandem repeats with ExpansionHunter Denovo (Default = "False"). 
 -STR			if this flag is set DNAscan2 will look for genome-wide novel/non-reference tandem repeats with ExpansionHunter Denovo  (Default = "False").		
 -genotypeSTR		if this flag is set in addition to STR, DNAscan will genotype the identified STR loci with ExpansionHunter (Default = "False").
 -SV                   if this flag is set the structural variant calling stage will be performed with Manta and Delly (Default = "False") 
 -MEI			if this flag is set, mobile element insertion/transposable element calling will be performed with MELT
 -virus                if this flag is set DNAscan2 will perform viral scanning (Default = "False")  
 -bacteria             if this flag is set DNAscan2 will perform bacteria scanning (Default = "False") 
 -custom_microbes      if this flag is set DNAscan2 will perform a customized microbe scanning according to the provided microbe data base in paths_configs.py (Default = "False")  
 -variantcalling       if this flag is set DNAscan2 will perform snv and indel calling (Default = "False")  
 -annotation           if this flag is set DNAscan2 will annotate the found SNV and indel variants with Annovar and AnnotSV for SV and MEI variants (Default = "False")  
 -results_report       if this flag is set DNAscan2 will generate a results report describing the annotated variants (Default = "False")  
 -alignment_report     if this flag is set DNAscan2 will generate an alignment report (Default = "False") 
 -sequencing_report    if this flag is set DNAscan2 will generate a report describing the input sequencing data (Default = "False") 
 -calls_report         if this flag is set DNAscan2 will generate a report describing the found snvs and indels (Default = "False")
 -rm_dup               if this flag is set DNAscan2 will remove duplicates from the sample bam file (Default = "False")
 -fast_mode            if this flag is set DNAscan2 will run without SV calling with Delly and the genotyping of STR loci identified with ExpansionHunter Denovo if those flags are set (Default = "True")

If only the --variantcalling flag is selected, Hisat2 and Strelka is used to quickly align and call variants, as it is ideal for detection of single nucleotide variants. If your analysis is focused on structural variants and transposable elements, BWA is addiitonally utilised to refine Hisat2 alignment on selected reads. This step improves the alignment of soft-clipped reads and reads containing small insertion and deletion variants. Structural variant calling with Delly and mobile element inserion with MELT can now be performed. Additionally, ExpansionHunter Denovo can be used to identify non-catalogue genome-wide short tandem repeat loci, with the added option to genotype these loci using ExpansionHunter (recommended only for higher spec computers/high performance computing systems if analysing whole genome sequencing data). For users with limited RAM and/or time constraints, we have introduced fast mode to allow structural variant and genome-wide short tandem repeat loci detection to still take place, without Delly (which is very time intensive (~24h per genome)) and genotyping of ExpansionHunter Denovo identified loci (requires a lot of RAM if performed on a genome-wide basis). This option is set to true by default.

Finally, a set of optional arguments can be used to customise the analysis:

-RG                   if this flag is set the alignment stage will add the provided in paths_configs.py read group (Default = "False")
-paired               options are 1 for paired end reads and 0 for single end reads (Default = "1")
-vcf	              complementary vcf file
-in2	    input file 2, for paired end reads only (usually fastq file)
-BED                  restrict the analysis to the regions in the bed file (Default = "False") 
-sample_name	      specify sample name [string] (default = "sample")
-debug		      if this flag is set DNAscan2 will not delete intermediate and temporary files (Default = "False")

How to make the read mappers and variant callers/annotators use custom options

The file paths_configs.py in DNAscanv2/scripts can be used for this purpose. In paths_configs.py there is a section where a string of options can be passed to the read mappers (Hisat2 and BWA) variant callers/annotators (MELT and AnnotSV):

#custom tool options

hisat_custom_options = ""

bwa_custom_options = ""

melt_custom_options = ""

annotsv_custom_options = ""

For example, if we know the coverage and read length of our samples (40x 150bp), we can tell MELT to skip coverage calculation step with:

melt_custom_options = "-cov 40 -r 150"

For a complete list of options that can be used with these tools please see the following links:

Hisat2 BWA MELT AnnotSV

Usage example

Let's assume we have human paired end whole exome sequencing data in two fastq files and want to perform snv/indels calling vs hg19, annotation and explore the results using the iobio services. The DNAscan command line would be:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -annotation -iobio -out /path/to/outdir/ 

If we perform the same analysis but on custom regions of the genome using the bed file provided in the data folder (after adding "data/test_data.bed") to the paths_configs file:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data/test_data.1.fq.gz -in2 data/test_data.2.fq.gz -reference hg19 -alignment -variantcalling -annotation -iobio -out outdir/ -BED

IMPORTANT: All paths in DNAscan end with "/"

Applying filters to the called variants

DNAscan uses bcftools to filter the variants. It only selects variants for which the expression provided is true. A few examples follow. Complete details about how to write an expression can be found in the bcftools manual LINK.

Filtering calls for which genotype quality is > then 30:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -annotation -iobio -out /path/to/outdir/  -filter_string "GQ>30"

Filtering calls for which genotype quality is > then 30 and depth >5:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -annotation -iobio -out /path/to/outdir/  -filter_string "GQ>30 && DP>5"

Looking for repeat expansions

Let's assume we have human paired end whole exome sequencing data in two fastq files and want to perform SNV/indel calling vs hg19 and scan for some specific and novel repeat expansions. The DNAscan2 command line would be:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -expansion -out /path/to/outdir/

Note that for ExpansionHunter, the json repeat-specification files must be specified in paths_configs.py. For a guide on how to create a json repeat-specification file see the this LINK to the ExpansionHunter instructions. Examples for 17 known repeat expansions, as well as ExpansionHunter variant catalogs for both hg19 and hg38 are in the DNAscan/repeats folder.

Running DNAscan2 on a subregion of the human region

Whole exome

You can restrict the analysis to the whole exome only. You just need to add the -exome flag to the command line:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -out /path/to/outdir/ -exome
Any genome regions

You can restrict the analysis to any specific region of the genome by providing DNAscan2 with a bed file. In this case you need to use the -BED flag when running the pipeline and specify the path to the bed file in paths_configs.py

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alignment -variantcalling -out /path/to/outdir/ -BED
List of genes

You can restrict the analysis to a list of genes. In this case you only need to generate a list of genes (see example in DNAscanv2/data) and specify the path to it in paths_configs.py. No flags need to be used in this case and DNAscan2 will read (non-case-sensitive) the list and restrict the analysis to the appropriate genome regions.

cat list_of_genes.txt

FUS
PFN1
SARM1
SCFD1
SOD1

Running DNAscan2 on a list of BAMs/FASTQs

Let's assume we have human paired end whole exome sequencing data in two fastq files per sample and want to perform snv/indel calling vs hg19. In this case we would use the analyse_list_of_samples.py script to run DNAscan2:

cd /path/to/DNAscanv2_main_dir

python3 scripts/analyse_list_of_samples.py -option_string "-format fastq -reference hg19 -alignment -variantcalling" -out_dir /path/to/where/you/want/the/analysis/to/take/place/ -sample_list extras/fastq_sample_list.txt -format fastq -paired 1 

The analyse_list_of_samples.py takes file containing the input file of one sample per line as input (tab separated in case of 2 paired end fastq files per sample), the option you want to pass to DNAscan2 between quotation marks, the input file format and whether or not (1 or 0) the data are paired-end reads. Two sample lists of input files are in DNAscanv2/extras.

ALSgeneScanner

ALSgeneScanner is a pipeline designed for the analysis of NGS data of ALS patients. It perfoms alignment, variant calling, structural variant calling and repeat expansion calling as well as variant annotation using Annovar. It restricts the analysis to a subset of genes (172) which have been shown to be associated with ALS. A complete list of the included genes is available as a Google Spreadsheet. It also prioritizes variants according to the scientific evidence of the gene association and the effect prediction of the variant. To run ALSgeneScanner the user has to use the appropriate flag as in the following example. This can be done using either the hg19 or hg38 reference genome. Further details are described in the ALSgeneScanner paper. ALSgenescanner by default performs SNV/indel, structural variant and expansion calling and annotation according to a BED file of the genomic coordinates of each gene associated with ALS.

Usage example for hg19:

python3 /path/to/DNAscanv2/scripts/DNAscan.py -format fastq -in data1.fq.gz -in2 data2.fq.gz -reference hg19 -alsgenescanner -out /path/to/outdir/

RefGene, ClinVar, dbnsfp33a and Intervar are necessary to run ALSgeneScanner.

Please use the following commands to download the appropiate Annovar databases:

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp33a /path/to/annovar/database/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene /path/to/annovar/database/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar clinvar_20210501 /path/to/annovar/database/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar intervar_20180118 /path/to/annovar/database/

Output

DNAscan2 output tree:

./$out_dir-| # this is the folder given to DNAscan using the -out flag. It will contain the aligned sequencing data ($sample_name.bam) as well as some temporary files
           |
           |-results # This will contain the output of the analyses. E.g. $sample_name_sorted.vcf.gz , $sample_name_manta_SV.vcf.gz, virus_results.txt, etc  
           |
           |-reports # If any report flags are used, this folder will contain the reports. E.g. $sample_name_vcfstats.txt if the -calls_report flag is used
           |
           |-logs # Logs files are generated in this folder
 

How to download the reference genome

DNAscan2 can be used with the following human genome reference builds: hg19, hg38, grch37, grch38. The used build can be specified when running DNAscan by using the option -reference (options are hg19, hg38, grch37 and grch38). However, DNAscan2 uses Annovar to annotate the SNV/indel variants and Annovar is only compatible with hg19 and hg38 at present. Therefore, if the user wants to use grch37 and grch38 the pipeline will skip the annotation step. We are currently working on this and it will be possible to run DNAscan on grch37 and grch38 without limitations.

hg38

mkdir /path/to/wherever/hg38

cd /path/to/wherever/hg38

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

gzip -d hg38.fa.gz

samtools faidx hg38.fa

hg19

mkdir /path/to/wherever/hg19

cd /path/to/wherever/hg19

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

tar -zxvf chromFa.tar.gz

for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrY.fa chrX.fa chrM.fa; do cat $i >> hg19.fa ; rm $i ; done

rm chr*

samtools faidx hg19.fa

grch37

mkdir /path/to/wherever/grch37

cd /path/to/wherever/grch37

for i in {1,2,3,4,5,6,7,8,9}; do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.3/CHR_0$i/hs_ref_GRCh37.p5_chr$i.fa.gz; zcat hs_ref_GRCh37.p5_chr$i.fa.gz >> grch37.fa ; rm hs_ref_GRCh37.p5_chr$i.fa.gz; done


for i in {10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT}; do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.3/CHR_$i/hs_ref_GRCh37.p5_chr$i.fa.gz; zcat hs_ref_GRCh37.p5_chr$i.fa.gz >> grch37.fa ; rm hs_ref_GRCh37.p5_chr$i.fa.gz; done

samtools faidx grch37.fa

grch38

mkdir /path/to/wherever/grch38

cd /path/to/wherever/grch38

for i in {1,2,3,4,5,6,7,8,9}; do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_0$i/hs_ref_GRCh38.p12_chr$i.fa.gz; zcat hs_ref_GRCh38.p12_chr$i.fa.gz >> grch38.fa ; rm hs_ref_GRCh38.p12_chr$i.fa.gz; done


for i in {10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT}; do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_$i/hs_ref_GRCh38.p12_chr$i.fa.gz; zcat hs_ref_GRCh38.p12_chr$i.fa.gz >> grch38.fa ; rm hs_ref_GRCh38.p12_chr$i.fa.gz; done

samtools faidx grch38.fa

How to download the NCBI microbe DBs

Virus

Copy and paste the following commands in your command line to download the whole NCBI database of complete viral genomes.

mkdir /path/to/wherever/virus_db

cd /path/to/wherever/virus_db

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.1.genomic.fna.gz

gzip -d viral.1.1.genomic.fna.gz

gzip -d viral.2.1.genomic.fna.gz

cat viral.1.1.genomic.fna >> virus_db.fa

cat viral.2.1.genomic.fna >> virus_db.fa

rm viral.1.1.genomic.fna viral.2.1.genomic.fna 

Bacteria

Copy and paste the following commands in your command line to download the whole NCBI database of complete bacterial genomes (this might take a long time).

mkdir /path/to/wherever/bacteria_db

cd /path/to/wherever/bacteria_db

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria.*.1.genomic.fna.gz

for i in $(ls | grep -e genomic.fna.gz -e bacteria); do gzip -d $i ; rm $i ; done 

for i in $(ls | grep -e genomic.fna -e bacteria); do cat $i >> bacteria_db.fa ; rm $i ; done  

Fungi

Copy and paste the following commands in your command line to download the whole NCBI database of complete fungi genomes.

mkdir /path/to/wherever/fungi_db

cd /path/to/wherever/fungi_db

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/fungi.*.1.genomic.fna.gz

for i in $(ls | grep -e genomic.fna.gz -e fungi); do gzip -d $i ; done 

for i in $(ls | grep -e genomic.fna -e fungi); do cat $i >> fungi_db.fa ; rm $i ; done  

How to index the reference genome or a microbe DB

HISAT2

/path/to/hisat/hisat2-build /path/to/reference/file/reference_genome.fa index_base

E.g. If the reference genome is the file hg19.fa, located in /home/dataset/ and the hisat2-build binary is located in /home/bin/, the command line would be:

/home/bin/hisat2-build /home/dataset/hg19.fa hg19

BWA

/path/to/bwa/bwa index /path/to/reference/file/reference_genome.fa

E.g. If the reference genome is the file hg19.fa, located in /home/dataset/ and the bwa binary is located in /home/bin/, the command line would be:

/home/bin/bwa index /home/dataset/hg19.fa

Graphical user interface

We made a graphical user interface (GUI) for DNAscan2 to enable a wider audience to use it (e.g. non-bioinformaticians, users unfamiliar with the command line) using PySimpleGUI.You can install dependencies, enter all required and customisation options, and view the constructed command line input before running DNAscan2. If you are happy with the options chosen, DNAscan2 will run in the window, and all the output files will be on your system when complete. There is a dedicated pdf manual for the DNAscan2 GUI which you can access from the DNAscanv2 directory if you want to run DNAscanv2 from scratch using this method. After downloading the repository on your system, execute the following command:

python3 DNAscan_gui.py

alt text

Figure 2. Overview of the DNAscan2 graphical user interface (GUI). A: Main window which first appears when running the GUI. The 'Dependency Installation' tab is displayed first to allow first time users to obtain all necessary resources. The dropdown menu, on the top left of the interface, provides the user with usage documentation, access to the DNAscan and ALSgeneScanner publications, and key information about the developers of DNAscan2. B: The 'Basic Options' tab of DNAscan2, populated with default settings. C: The 'Customisation' tab of DNAscan2 - the 'Alignment', 'Variant Calling', 'Reports' and 'Regions' sections allow the user to tailor DNAscan2 to their specifications. Depending on these options, the number of advanced options that show up in the respective tab will differ. Before the 'Run DNAscan' button is pressed, if there are any advanced options or read group information (used with alignment + add read group options), these need to be added first with the 'Add Advanced Options' and 'Add Read Group Info' buttons.

Dependencies

List of dependencies

  • Samtools >= 1.9
  • HISAT2 = 2.2.1
  • BWA = 0.7.17
  • Strelka = 2.9.10
  • Python >= 3 (if using Expansion Hunter 3.2.2, only versions 3.5.x, 3.6.x, >=3.7.0, >= 3.8.0 are supported - tested with 3.8.0)
  • Pysam = 0.16.0.1
  • Perl (tested with 5.16.3 x86_64-Linux)
  • Vcftools = 0.1.16
  • Bedtools = 2.25
  • Samblaster >= 0.1.24
  • Sambamba >= 0.6.6
  • Manta 1.6.0 (optional, needed only if interested in structural variants)
  • Delly v0.8.7 (optional, needed only if interested in structural variants)
  • MELT v2.2.2 (optional, needed only if interested in mobile insertion elements)
  • ExpansionHunter = 3.2.2 (optional, needed only if interested in known motif expansions)
  • ExpansionHunter Denovo = 0.9.0 (optional, needed only if interested in non-reference/novel hort tandem repeat motif discovery)
  • Bcftools >= 1.9 (optional, needed only if interested in performing custom variant filtering and calls report)
  • Annovar "Version >= $Date: 2016-02-01 00:11:18 -0800 (Mon, 1 Feb 2016)" (optional, needed only if interested in performing variant annotation)
  • AnnotSV = 3.0.8 (optional, needed only if interested in performing variant annotation)

Tools needed for generating graphical reports

  • RTG Tools >= 3.6.2
  • Multiqc >= 1.2
  • Fastqc >=0.11.9
  • knotAnnotSV = 1.0.0 (optional, needed only if interested in structural elements

Tools needed for a container based deployment

  • Docker >= 1.7.1
  • Singularity >= 2.2

Tools needed to run the graphical user interface

  • PySimpleGUI = 4.60.4

An environment YAML file (DNAscan2.yaml) is provided for dependencies which can be installed via Conda (all dependencies apart from Strelka, Manta, MELT, ExpansionHunterDenovo, Annovar, AnnotSV, knotAnnotSV) to prevent software version clashes. It is recommended to create a new conda environment to store these dependencies:

conda env create -f DNAscan2.yaml

This will create an environment entitled DNAscan2. To activate the environment:

conda activate DNAscan2

How to install dependencies

Using Bioconda

For a fast and easy deployment of most dependencies we recommend the use of the Miniconda3 package. To download and install the latest Miniconda3 package, which contains the conda package manager:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p /path/to/Miniconda3/installation/directory

In the file /path_to_your_home_dir/.bashrc add the following line:

export PATH=/path/to/Miniconda3/installation/directory/Miniconda3/bin:$PATH 

Bioconda is a repository of binary bioinformatics tools which makes it very easy to install many open source softwares. To install the needed dependencies with Bioconda:

conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda

And then, if you want to install samtools for example:

conda install samtools

And add samtools to paths_configs.py:

path_samtools = "path/to/miniconda3/samtools/bin/

NOTE: if using Miniconda3 to install dependencies, paths to all binary dependencies end in bin/ as shown above.

Download Annovar

A more complete documentation about how to set up and use Annovar can be found here. However, in the following, there are some brief instructions on how to get Annovar and the necessary databases to use DNAscan annotation.

The latest version of Annovar can be downloaded here (registration required).

After you have downloaded Annovar:

tar xzvf annovar.latest.tar.gz

Now let's download some data bases for the DNAscan annotation step (assuming you want to work with hg19):

cd annovar

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp33a humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar clinvar_20210501 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar cadd humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar intervar_20180118 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad211_genome humandb/

Finally, we have to update paths_configs.py appropriately. Details about how to download and use the Annovar databases can be found here. For the above databases:

annovar_protocols = "refGene,dbnsfp33a,clinvar_20210501,avsnp147,cadd,intervar_20180118,ALL.sites.2015_08,gnomad211_genome"

annovar_operations = "g,f,f,f,f,f,f,f"

Docker and Singularity

Docker is an open-source project that automates the deployment of applications inside software containers. Using containers to deploy our system and creating our analysis environment would allow us to make our work independent of the machine we work on. This would improve the reproducibility of our science, the portability and reliability of our deployments and avoid any machine specific issues. For this reason working using containers isn't just recommended but also makes things easier. Since docker is widely used and maintained we recommend it as container technology to use if possible. Unfortunately Docker does require sudo privileges to run its containers making its use difficult on HPC facilities.

Singularity is also a container project similar to Docker and does not require sudo privileges to run. This can be very important if you decide to use our framework on a machine for which you do not have such privileges. E.g. your institution HPC cluster. In this case you can use Singularity to convert the docker image into a singularity image and run a bash shell in the resulting Singularity container:

$ singularity shell docker://hevmarriott/dnascan2_hg19 OR docker://hevmarriott/dnascan2_hg38

After starting the bash shell inside the singularity container you can find a working deployment of DNAscan in /DNAscan

$ Singularity.dnascan2_hg19 OR .dnascan2_hg38 > cd /DNAscan
$ Singularity.dnascan2_hg19 OR .dnascan2_hg38 > cat /DNAscan/docker/welcome_message.txt
Docker setup
ubuntu

LINK

MAC

LINK

Windows

LINK

Singularity setup
Linux

LINK

MAC

LINK

Core Contributors

Contributors

  • NAME, [COUNTRY OR AFFILIATION]

For a full list of contributors see LINK

Contributing

Here’s how we suggest you go about proposing a change to this project:

  1. Fork this project to your account.
  2. Create a branch for the change you intend to make.
  3. Make your changes to your fork.
  4. Send a pull request from your fork’s branch to our master branch.

Using the web-based interface to make changes is fine too, and will help you by automatically forking the project and prompting to send a pull request too.

Licence


Core Developers funded as part of:
MNDA
NIHR Maudsley Biomedical Research Centre (BRC), King's College London
Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London

alt text