codoff: A program to measure the irregularity of the codon usage for a focal genomic region (e.g. a BGC, prophage, etc.) relative to the full genome. It was primarily designed to work off the output of antiSMASH biosynthetic gene cluster (BGC) predictions - but can be more broadly applied as well, with options allowing users to provide a genome in FASTA format and simply specify the coordinates of the focal region of interest. It computes an empirical P-value as to the significance of observing a codon usage profile for the focal region so discordant with what is observed for the larger genome-wide context.
This is useful because it could indicate that the focal gene cluster has been horizontally transfered. While quick and easy to assess, we encourage users to further investigate if HGT is indeed responsible for any such signal. This is because a discordance between the codon usage for the focal gene cluster and background genome could be due to other reasons, e.g. infrequent expression of the focal gene cluster.
codoff is largely adapted from the compareBGCtoGenomeCodonUsage.py script in the lsaBGC suite and presented here separately just to make it easy to install via bioconda, therefore, please cite:
Evolutionary investigations of the biosynthetic diversity in the skin microbiome using lsaBGC. Microbial Genomics 2023. Rauf Salamzade, J.Z. Alex Cheong, Shelby Sandstrom, Mary Hannah Swaney, Reed M. Stubbendieck, Nicole Lane Starr, Cameron R. Currie, Anne Marie Singh, and Lindsay R. Kalan
If you have suggestions or feedback, please let us know in the GitHub issues.
Note, (for some setups at least) it is critical to specify the conda-forge channel before the bioconda channel to properly configure priority and lead to a successful installation.
Recommended: For a significantly faster installation process, use mamba in place of conda in the below commands, by installing mamba in your base conda environment.
conda create -n codoff_env -c conda-forge -c bioconda codoff
conda activate codoff_env# in some environment with python >=3.10 available
pip install codoff# 1. clone Git repo and change directories into it!
git clone https://github.com/raufs/codoff/
cd codoff/
# 2. create conda environment using yaml file and activate it!
conda env create -f codoff_env.yml -n codoff_env
conda activate codoff_env
# 3. complete python installation with the following command:
pip install -e .Uncompress the example inputs (trimmed down to save space) from this git repo for Staphylococcus warneri st. 413:
# within codoff git repo
tar -zxvf Sw_LK413.tar.gz
Example 1: Providing focal region and genome-wide GenBank files as input (e.g. based on antiSMASH outputs)
(WORKS FOR BOTH EUKARYOTES & BACTERIA) Focal region and full-genome provided as GenBank files with CDS features (compatible with antiSMASH outputs). Multiple focal region GenBank files can be provided, e.g. consider a biosynthetic gene cluster split across multiple scaffolds due to assembly fragmentation.
codoff -f Sw_LK413/NZ_JALXLO020000001.1.region001.gbk -g Sw_LK413/LK413.gbk(WORKS ONLY FOR BACTERIA) Full genome is provided as a FASTA or GenBank file. If CDS features are missing, gene calling is performed using pyrodigal. Afterwards, the focal region is determined through user speciefied coordinates.
codoff -s NZ_JALXLO020000001.1 -a 341425 -b 388343 -g Sw_LK413/LK413.fna -p example_plot.svgHere, we also requested the -p argument to generate a plot of the simulated distribution of cosine distances for regions of similar size to the focal region and the actual cosine distance for the focal region/cluster (blue vertical line):
A more detailed example on how homologous instances of the same five-gene operon differ in codon usage between an instance on the plasmid and chromosome can be found on the wiki: Examples of inferring codon usage for the crt operon in a chromosomal and plasmid context
codoff uses a Monte Carlo simulation approach to calculate an empirical P-value to assess codon usage differences between the focal region(s) of interest and the background genome. The algorithm works as follows:
- Extract all CDS features from the genome (genes with lengths divisible by 3)
- Calculate codon frequency distributions for:
- Focal region: Codons from genes in the BGC/focal region
- Background genome: All remaining genes in the genome
- Compute cosine distance between focal and background codon frequency distributions
- Calculate Spearman correlation coefficient between the two distributions
For each of N simulations (default: 10,000, configurable with --num-sims):
- Shuffle the complete list of all genes in the genome
- Select genes sequentially from the shuffled list until accumulating the same number of codons as in the actual focal region
- Calculate codon frequencies for this simulated "focal" region
- Calculate background frequencies as:
total_genome_counts - simulated_focal_counts - Compute cosine distance between simulated focal and background frequencies
- Count how many simulated distances ≥ observed distance
The empirical P-value is calculated as:
P-value = (count of simulations with distance ≥ observed distance + 1) / (total simulations + 1)
This approach tests whether the observed focal region's codon usage is significantly different from what would be expected if we randomly selected the same amount of coding sequence from anywhere in the genome.
Beginning in v1.2.0, codoff can also be used as a function in your Python code. Check out this wiki page for info on how to do this and details on the API.
Running version 1.2.2 of codoff!
usage: codoff [-h] -g FULL_GENOME [-s SCAFFOLD] [-a START_COORD] [-b END_COORD] [-f FOCAL_GENBANKS [FOCAL_GENBANKS ...]] [-o OUTFILE] [-p PLOT_OUTFILE] [--num-sims NUM_SIMS] [-v]
Program: codoff
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison
This program compares the codon-usage distribution of a focal-region/BGC
to the codon usage of the background genome. It will report the cosine
distance and Spearman correlation between the two profiles, as well as
an empirical P-value for whether the codon usage is siginficantly
different between the focal region and background genome. Only CDS
features which are of length divisible by 3 will be considered.
Two modes of input are supported:
1. (WORKS FOR BOTH EUKARYOTES & BACTERIA) Focal region and full-genome
provided as GenBank files with CDS features (compatible with
antiSMASH outputs). Multiple focal region GenBank files can be provided,
e.g. consider a biosynthetic gene cluster split across multiple
scaffolds due to assembly fragmentation.
Example command:
$ codoff -f Sw_LK413/NZ_JALXLO020000001.1.region001.gbk -g Sw_LK413/LK413.gbk
$ codoff -f region.gbk -g genome.gbk --num-sims 5000
2. (WORKS ONLY FOR BACTERIA) Full genome is provided as a FASTA or GenBank
file. If CDS features are missing gene calling is performed using
pyrodigal. Afterwards, the focal region is determined through user
speciefied coordinates.
Example command:
$ codoff -s NZ_JALXLO020000001.1 -a 341425 -b 388343 -g Sw_LK413/LK413.fna
$ codoff -s scaffold -a 1000 -b 5000 -g genome.fna --num-sims 20000
options:
-h, --help show this help message and exit
-g FULL_GENOME, --full-genome FULL_GENOME
Path to a full-genome in GenBank or FASTA format. If GenBank file
provided, CDS features are required.
-s SCAFFOLD, --scaffold SCAFFOLD
Scaffold identifier for focal region.
-a START_COORD, --start-coord START_COORD
Start coordinate for focal region.
-b END_COORD, --end-coord END_COORD
End coordinate for focal region.
-f FOCAL_GENBANKS [FOCAL_GENBANKS ...], --focal-genbanks FOCAL_GENBANKS [FOCAL_GENBANKS ...]
Path to focal region GenBank(s) for isolate. Locus tags must match
with tags in full-genome GenBank.
-o OUTFILE, --outfile OUTFILE
Path to output file [Default is standard output].
-p PLOT_OUTFILE, --plot-outfile PLOT_OUTFILE
Plot output file name (will be in SVG format). If not provided, no
plot will be made.
--num-sims NUM_SIMS Number of simulations to run [Default: 10000]
-v, --version Print version and exist
BSD 3-Clause License
Copyright (c) 2024, Kalan-Lab
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
codoff utilizes pyrodigal underneath to perform gene calling for bacterial genomes when requested:
Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. JOSS, 2022. Martin Larallde.
Earlier work on using codon usage discrepancies to infer genomic islands was previously done by Waack et al. 2006:
Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics, 2006. Stephan Waack, Oliver Keller, Roman Asper, Thomas Brodag, Carsten Damm, Wolfgang Florian Fricke, Katharina Surovcik, Peter Meinicke, and Rainer Merkl.