Get the consensus sequences for short sequencing reads mapped to a reference. Version 2.0
The program takes as input a SAM file (.sam or .sam.gz) resulting from mapping short reads to a reference (the reference
sequences can correspond to separate genes for example), then it calculates the consensus sequence from the aligned
reads alone. If you have BAM files you will need to convert them first with samtools. A single or multiple consensus thresholds can be specified, the program also adds insertions, if many long
long insertions are expected, we recommend to perform indel ralignment before for optimal results. The consensus method
is the one used by Geneious and described in detail in http://assets.geneious.com/manual/8.1/GeneiousManualse41.html
Regions with no coverage are filled with -s (or a different character if specified). Input SAM files don't need to be sorted. Original reference FASTAs are not necessary since the consensus is reference-free.
It will produce a FASTA file per reference containing as many sequences as thresholds were specified.
Just type python sam2consensus.py -h to show the help of the program:
usage: sam2consensus.py [-h] -i FILENAME [-c THRESHOLDS] [-n N] [-o OUTFOLDER]
                        [-p PREFIX] [-m MIN_DEPTH] [-f FILL] [-d MAXDEL]
+------------------------------------------------------------------+
| sam2consensus.py: extract the consensus sequence from a SAM file |
+------------------------------------------------------------------+
The program takes as input a SAM file (.sam or .sam.gz) resulting from mapping
short reads to a reference (the reference sequences can correspond to separate
genes for example), then it calculates the consensus sequence from the aligned
reads alone. A single or multiple consensus thresholds can be specified, the
program also adds insertions, if many long insertions are expected, we recommend
to perform indel ralignment before for optimal results. The consensus method is
the one used by Geneious and described in detail in
http://assets.geneious.com/manual/8.1/GeneiousManualse41.html
Regions with no coverage are filled with -s (or a different character if
specified). Input SAM files don't need to be sorted. Original reference FASTAs
are not necessary since the consensus is reference-free.
It will produce a FASTA file per reference containing as many sequences as
thresholds were specified.
optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of the SAM file, SAM does not need to be sorted
                        and can be compressed with gzip
  -c THRESHOLDS, --consensus-thresholds THRESHOLDS
                        List of consensus thresold(s) separated by commas, no
                        spaces, example: -c 0.25,0.75,0.50, default=0.25
  -n N                  Split FASTA output sequences every n nucleotides,
                        default=do not split sequence
  -o OUTFOLDER, --outfolder OUTFOLDER
                        Name of output folder, default=same folder as input
  -p PREFIX, --prefix PREFIX
                        Prefix for output file name, default=input filename
                        without .sam extension
  -m MIN_DEPTH, --min-depth MIN_DEPTH
                        Minimum read depth at each site to report the
                        nucleotide in the consensus, default=1
  -f FILL, --fill FILL  Character for padding regions not covered in the
                        reference, default= - (gap)
  -d MAXDEL, --maxdel MAXDEL
                        Ignore deletions longer than this value, default=150
In the following examples, if you added the script to the PATH you can omit python from the commands.
Example 1: Using the default consensus threshold of 0.25:
python sam2consensus.py -i myfile.samExample 2: Using multiple consensus thresholds of 0.25, 0.50, and 0.75 and specifying an output folder:
python sam2consensus.py -i myfile.sam -c 0.25,0.50,0.75 -o resultsExample 3: Using a custom consensus threshold of 0.75 and changing padding character for uncovered regions to "N":
python sam2consensus.py -i myfile.sam -c 0.75 -f NExample 4: Specifying a prefix "sample1" and increasing minimum mapping depth to 10:
python sam2consensus.py -i myfile.sam -pre sample1 -m 10- Code: Edgardo M. Ortiz
- Data and testing: Deise J. P. Gonçalves