Genomic epidemiology with mixed samples: a tutorial

This tutorial constains instructions on how to repproduce the results of the three main synthetic experiments from the preprint Genomic epidemiology with mixed samples.

The tutorial will focus on reproducing the Escherichia coli experiment but contains instructions on how to adapt the scripts to the Enterococcus faecalis and Staphylococcus aureus experiments.

For quick instructions on how to run the pipeline in a general setting, please refer to the README.md file in the root of this repository.

Note: This version of the tutorial assumes that you are using a Themisto version between v0.1.1 and v1.2.0. Check the tutorial from an updated version of mGEMS for instructions on how to use newer versions of Themisto.

Requirements

mGEMS pipeline

Themisto
- Themisto version should be between v0.1.1 and v1.2.0
- For compatibility with newer versions of Themisto, update your mGEMS installation.
mSWEEP
mGEMS
shovill

Phylogenetic analysis

snippy
RAxML-NG

Extra (macOS only)

GNU coreutils

On a macOS system, you'll also need to install GNU coreutils from homebrew and alias the macOS zcat command to the GNU zcat command for the duration of the session

brew install coreutils
alias zcat=gzcat
ulimit -n 2048

Concurrent file connections limit

macOS also limits the number of concurrent file connections, which will have to be increased to run Themisto and shovill

ulimit -n 2048

Tutorial

Select a species

Download the supplementary table from the mGEMS manucsript which contains the relevant information

wget https://zenodo.org/record/3724144/files/mGEMS_Supplementary_Table_mixed_samples.tsv

Filter the table to contain only the E. coli (ecoli) experiments

grep "ecoli" mGEMS_Supplementary_Table_mixed_samples.tsv > mixed_samples.tsv

If you want to reproduce the E. faecalis experiments, change 'ecoli' to 'efaec'. For S. aureus, change 'ecoli' to 'saur'. Running these other two experiments may require resources beyond the typical laptop or desktop computer.

Reference data

The reference data from the preprint available from zenodo

E. coli
E. faecalis
S. aureus

Construction of the reference dataset(s) is describe in more detail in the preprint.

Download and extract the E. coli dataset by running

wget https://zenodo.org/record/3724112/files/mGEMS-ecoli-reference-v1.0.0.tar.gz
tar -zxvf mGEMS-ecoli-reference-v1.0.0.tar.gz

Indexing

Create a 31-mer pseudoalignment index with Themisto using two threads and maximum 8192 megabytes of RAM.

Themisto version v2.0.0 or newer

mkdir mGEMS-ecoli-reference
mkdir mGEMS-ecoli-reference/tmp
themisto build -k 31 -i mGEMS-ecoli-reference-sequences-v1.0.0.fasta.gz -o mGEMS-ecoli-reference/index --temp-dir mGEMS-ecoli-reference/tmp --mem-megas 8192 --n-threads 2

Themisto versions v0.1.1 to v1.2.0

mkdir mGEMS-ecoli-reference
mkdir mGEMS-ecoli-reference/tmp
build_index --k 31 --input-file mGEMS-ecoli-reference-sequences-v1.0.0.fasta.gz --auto-colors --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --mem-megas 8192 --n-threads 2

change 'ecoli' to 'efaec' or 'saur' if you are trying to reproduce the other experiments.

Synthetic mixed samples

Download the isolate sequencing data and create the synthetic mixed samples by concatenating the isolate files

## Download the sequencing data and create the samples
oldid=""
while read line; do
	id=$(echo $line | cut -f3 -d' ')
	sample=$(echo $line | cut -f1 -d' ')
	scripts/get_forward.sh $sample | gunzip -c >> $id""_1.fastq
	scripts/get_reverse.sh $sample | gunzip -c >> $id""_2.fastq
	if [[ "$id" != "$oldid" ]]; then
		if [ ! -z "$oldid" -a "$oldid" != "" ]; then
			gzip $oldid""_1.fastq
			gzip $oldid""_2.fastq
		fi
	fi
	oldid=$id
done < mixed_samples.tsv
gzip $oldid""_1.fastq
gzip $oldid""_2.fastq

Pseudoalignment

Align the mixed sample files against the index using two threads

Themisto version v2.0.0 or newer

for f1 in *_1.fastq.gz; do
	f=${f1%_1.fastq.gz}
	f2=$f""_2.fastq.gz
	themisto pseudoalign -q $f1 -o $f""_1.aln -i mGEMS-ecoli-reference/index --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
	themisto pseudoalign -q $f2 -o $f""_2.aln -i mGEMS-ecoli-reference/index --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
done

Themisto versions v0.1.1 to v1.2.0

for f1 in *_1.fastq.gz; do
	f=${f1%_1.fastq.gz}
	f2=$f""_2.fastq.gz
	pseudoalign --query-file $f1 --outfile $f""_1.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
	pseudoalign --query-file $f2 --outfile $f""_2.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
done

Abundance estimation

Estimate the relative abundances with mSWEEP and write the results and posterior probabilities using two threads

for f1 in *_1.fastq.gz; do
	f=${f1%_1.fastq.gz}
	mkdir $f
	mSWEEP --themisto-1 $f""_1.aln.gz --themisto-2 $f""_2.aln.gz --themisto-index mGEMS-ecoli-reference -i mGEMS-ecoli-reference-grouping-v1.0.0.txt -o $f/$f --write-probs --gzip-probs -t 2
done

Binning

Bin the reads with mGEMS and write the binned samples to the 'ecoli-1' folder.

while read line; do
	id=$(echo $line | cut -f3 -d' ')
	cluster=$(echo $line | cut -f2 -d' ')
	mGEMS --groups $cluster -r $id""_1.fastq.gz,$id""_2.fastq.gz -i mGEMS-ecoli-reference-grouping-v1.0.0.txt --themisto-alns $id""_1.aln.gz,$id""_2.aln.gz -o $id --probs $id/$id""_probs.csv.gz -a $id/$id""_abundances.txt --index mGEMS-ecoli-reference
done < mixed_samples.tsv

Note that by default mGEMS creates bins for all reference lineages. If know which lineages the samples originate from (in our case these are supplied in the mixed_samples.tsv in the second column), the '--groups' option enables you to only create those bins. Multiple groups can be binned in the single run by supplying them as a comma-separated list.

Assembly

Assemble the sequences with shovill using 2 threads and maximum of 8192 megabytes of RAM

while read line; do
	id=$(echo $line | cut -f3 -d' ')
	cluster=$(echo $line | cut -f2 -d' ')
	shovill --outdir $id/$cluster --R1 $id/$cluster""_1.fastq.gz --R2 $id/$cluster""_2.fastq.gz --cpus 2 --ram 8
	mv $id/$cluster/contigs.fa $id/
	rm -rf $id/$cluster
	mkdir $id/$cluster
	mv $id/contigs.fa $id/$cluster/
done < mixed_samples.tsv

SNP calling

Download the reference sequence 'NCTC13441' from the ENA

wget -O NCTC13441.fasta.gz http://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/uf/UFZF01.fasta.gz

Call SNPs in the genome with snippy

mkdir snippy-tmp
gunzip NCTC13441.fasta.gz
while read line; do
	id=$(echo $line | cut -f3 -d' ')
	sample=$(echo $line | cut -f1 -d' ')
	cluster=$(echo $line | cut -f2 -d' ')
	snippy --outdir $id/$cluster/$sample --ctgs $id/$cluster/contigs.fa --ref NCTC13441.fasta --cpus 2 --ram 8 --tmpdir snippy-tmp
done < mixed_samples.tsv
gzip NCTC13441.fasta

Build the core SNP alignment with snippy

snippys=""
while read line; do
	id=$(echo $line | cut -f3 -d' ')
	sample=$(echo $line | cut -f1 -d' ')
	cluster=$(echo $line | cut -f2 -d' ')
	snippys=$snippys""$id/$cluster/$sample" "
	last=$id/$cluster/$sample
done < mixed_samples.tsv
snippy-core --ref $last/ref.fa $snippys

the alignment will be stored in the 'core.full.aln' file.

Phylogenetic inference

Use RAxML-NG to infer a maximum likelihood phylogeny from 10 starting parsimony and random trees under the GTR+G4 model

raxml-ng --search --msa core.full.aln --prefix CT --threads 2 --tree rand{10},pars{10} --model GTR+G4

Calculate bootstrap support values with 100 replicates

raxml-ng --bootstrap --msa core.full.aln --bs-trees 100 --prefix CB --threads 2 --model GTR+G4

Perform a bootstrap convergence check

raxml-ng --bsconverge --bs-trees CB.raxml.bootstraps --prefix CS --seed 2 --threads 2

Add the bootstrap support values to the maximum likelihood tree with the best likelihood

raxml-ng --support --tree CT.raxml.bestTree --bs-trees CB.raxml.bootstraps --prefix CS --threads 2

The best tree with the bootstrap values will be written in the 'CS.raxml.support' file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUTORIAL.md

TUTORIAL.md

Genomic epidemiology with mixed samples: a tutorial

Requirements

mGEMS pipeline

Phylogenetic analysis

Extra (macOS only)

GNU coreutils

Concurrent file connections limit

Tutorial

Table of Contents

Select a species

Reference data

Indexing

Synthetic mixed samples

Pseudoalignment

Abundance estimation

Binning

Assembly

SNP calling

Phylogenetic inference

Files

TUTORIAL.md

Latest commit

History

TUTORIAL.md

File metadata and controls

Genomic epidemiology with mixed samples: a tutorial

Requirements

mGEMS pipeline

Phylogenetic analysis

Extra (macOS only)

GNU coreutils

Concurrent file connections limit

Tutorial

Table of Contents

Select a species

Reference data

Indexing

Synthetic mixed samples

Pseudoalignment

Abundance estimation

Binning

Assembly

SNP calling

Phylogenetic inference