a Nextflow pipeline to simulate metagenomic samples
Metagenomes are all the genetic materials in a sample. For example, if you have a nasal swab, then its metagenome may contain your DNA but also DNA from the bacteria, viruses and fungi in your nasal swab sample. Metagenomics is often considered superior to traditional microbiological techniques for detecting pathogens as many microbes are not culturable. However, metagenomes are complex, and it is sometimes desirable to have control over what's included in the sample - particularly when it comes to testing metagenomic analytical tools.
As an example of how metagenome_sim-nf
can be used, I originally wrote the piepline to help evaluate the accuracy of kraken2
and bowtie2
in detecting Streptococcus pneumoniae. Firstly, I simulated metagenomic samples that contained just Streptococcus pneumoniae - applying kraken2
and bowtie2
on these samples helped evaluate the sensitivity of these tools. Secondly, I simulated metagenomic samples that contained other species of Streptococcus which are often easily mistaken for Streptococcus pneumoniae - applying kraken2
and bowtie2
on these samples helped evaluate the specificity of these tools. Of course, metagenome_sim-nf
can be used to simulate metagenomes of all kinds of species and not just Strep (as long as you have a reference genome). It is not restricted to testing kraken2
and bowtie2
either.
The main workflow can be found in main.nf
-
Designing the community present in the metagenomic sample. That is, what species/strains (reference fasta file) are present and how much do they contribute to the metagenome. The former is sampled from the Poisson distribution, the latter is sampled from the Dirichlet distribution (statistical sampling method by Gerry Tonkin-Hill). The source code is in modules/designCommunity.nf, which in turns calls the bin/designCommunity.py python script.
-
Simulating the metagenomes based on the community structure generated above. This is done using
art_illumina
. Currently the pipeline only supports Illumina sequencing. Source code for this step is in modules/simReads.nf -
Normalising the fastq files output from step 2.
art_illumina
introduces dashes '-' when there's a deletion, modules/normReads.nf for that by replacing the '-' with an N. More thoughts needed regarding whether is is an appropriate fix.
- Nextflow
- Docker if using own machine or Singularity if using a shared HPC
if run without a container (e.g. when -c lsf.config
is not activated)
- python3 packages:
numpy
,pandas
- art
- Clone this repo
git clone git@github.com:Phuong-Le/metagenome_sim-nf.git
nf_script=/path/to/main.nf
config_file=/path/to/customed_config/file #eg lsf.config
sample_size=number of metagenomes to be generated
mean_genomes=the average number of genomes (species/strains) to be included (the actual number is chosen by Poisson sampling)
depth=simulated sequencing depth, default to 500
outdir=/path/to/dir/containing/fastq/files
ref_ls_file=/path/to/file/containing/genomes/allowed/in/simulated/metagenome
nextflow run ${nf_script} -c ${config_file} \
--sample_size ${sample_size} --mean_genomes ${mean_genomes} --depth ${depth} --outdir ${outdir} --ref_ls_file ${ref_ls_file}
example on an lsf system like at Sanger (note that you could still use raw nextflow run
like above)
module load ISG/singularity/3.10.0
module load nextflow/22.10.3-5834
bsub -cwd /path/to/working_dir -o %J.out -e %J.err -R "select[mem>1000] rusage[mem=1000]" -M1000 \
"nextflow run ${nf_script} -c ${config_file} --sample_size ${sample_size} --mean_genomes ${mean_genomes} --depth ${depth} --outdir ${outdir} --ref_ls_file ${ref_ls_file}"
demo file for ${ref_ls_file}
is found in demo_files
Phuong Le (email: al35@sanger.ac.uk) and Vicky Carr
Thanks to Gerry Tonkin-Hill for sharing his method to design and simulate metagenomes
Thanks to Harry Hung for the great Nextflow advice
Could review the designCommunity for more flexibility, and potentially a simulation that's closer to reality
normReads handling of the '-' character should be reviewed as well
add help message
incorporate nf-test