Example of using BIB to identify Staphylococcus aureus strains. To run all the commands at once, install the required software and run the run_tutorial.sh
script.
Steps 1.-6. can take up to an hour to run. The end result ref_seqs_gapless.fasta
file is provided if you wish to skip the steps.
The example read file SRR016122.fastq.gz
can be downloaded from the European Nucleotide Archive.
-
Obtain the full alignment of the assemblies.
progressiveMauve --output=full_alignment.xmfa data/saur_assemblies.fasta
-
Extract LCBs shared by all genomes. The first number "500" is the minimum length of the LCB; the second number "4" indicates the minimum number of genomes that share an LCB.
stripSubsetLCBs full_alignment.xmfa full_alignment.xmfa.bbcols core_alignment.xmfa 500 4
-
Concatenate all the LCBs.
perl xmfa2fasta.pl --file core_alignment.xmfa > core_alignment.fasta
-
Run hierBAPS to obtain a clustering with "1" level and a maximum of "10" clusters. The clustering is stored in the
results.partition.txt
file. Sequences ">1" and ">2" belong to cluster 1; sequences ">3" and ">4" to cluster 2.hierBAPS.sh exData core_alignment.fasta fasta
hierBAPS.sh hierBAPS seqs.mat 1 10 results
-
Select sequences ">1" and ">3" to represent the clusters using
fastagrep.sh
from the BitSeq package.fastagrep.sh ">1 " core_alignment.fasta > ref_seqs.fasta
fastagrep.sh ">3 " core_alignment.fasta >> ref_seqs.fasta
-
Remove gaps from the reference sequences.
sed 's/-//g' ref_seqs.fasta > ref_seqs_gapless.fasta
-
Build the alignment index.
python BIB_prepare_index.py ref_seqs_gapless.fasta reference_alignment_index
-
Perform the read alignment and abundance estimation.
python BIB_analyse_reads.py SRR016122.fastq.gz ref_seqs_gapless.fasta reference_alignment_index SRR016122_abundances