This snakemake pipeline allows direct download from NCBI's SRA database with fastq-dump
The pipeline handles raw reads records of the bacterial genome from SRA Accessions to Annotated de novo Assemblies
Variant calling will also be performed after mapped to the reference genomes provided. Core SNPs called from regions shared by all input sequences will be produced at the end of the pipline.
All the output files will be assessed by 1) fastqc, 2) QUAST
- install miniconda 3
- create working directory
git clone https://github.com/rx32940/BactAsm.git
cd BactAsm
create environmental.yaml file - create conda env for snakemake - add dependencies for plotting
conda env env create -n BactAsm --file env/environment.yaml python=3.7
python BactAsm.py -h
usage: BactAsm.py [-h] [-s] [-b] [-l] [-f] [-o] [-t] [-k] [-g] [-c]
Fetch SRA records from NCBI and perform de novo assemble & read alignments to reference genome
optional arguments:
-h, --help show this help message and exit
-s , --sra SRA accession ID you would like to download
-b , --sampleID sampleID of the sample (this can be same as the SRA ID)
-l , --list input list (provide each sample' SampleID and sraID in a row, separated by TAB)
-f , --ref reference genome (required)
-o , --output output directory
-t , --thread number of threads to use
-k , --kingdom which kingdom the genome is from, default is Bacteria
-g , --genus which genus the genome is from, default is Leptospira
1) modify config file
2) Add the Bacterial genus of interst to config.yaml
3) Add SAMN Accession and SRA Accession to config.yaml
4) add expected output dir to config.yaml
5) add directory to the reference genome to the config file if available
6) refer to the examples in the config file for exact instruction
7) modify the maximum allowance of threads in config.yaml
sbatch submit_sapelo2.sh
- download fastq files from NCBI with samples provided in the config file
- fastqc all the raw reads files
- combine fastqc with multiqc
- trim raw reads with fastp
- fastqc paired trimmed reads again
- aggregate fastqc reports with multiqc
- use SPAdes for de novo assemble
outputdir/asm
- use quast w/o reference genome for de novo assemblies assessments
- aggregate assessments with multiqc
- use PROKKA for genome annotation
- use Snippy to call variant from the reference genome provided (no need to index the reference genome)
- aggregate variants for core SNPs detection