A Python script for the comparison of Prodigal and Pyrodigal predictions.
Prodigal is a gene prediction tool for prokaryotic genomes. It is widely used in bioinformatics, however the latest version is from 2016 and is missing unreleased bug fixes. Pyrodigal provides Cython bindings and a Python interface to Prodigal, including the unpublished bug fixes and other optimizations.
The Python script compare.py
compares the predictions of Prodigal and Pyrodigal to find mismatches or missing
predictions. The differences in the predictions are stored in a TSV file.
In order to compare Prodigal and Pyrodigal correctly, a current version of Prodigal is required, which must be compiled by the user, as there is no new version available.
git clone https://github.com/hyattpd/Prodigal.git
cd Prodigal
make install
If you want to install Prodigal in a custom directory use:
make install INSTALLDIR=/where/i/want/prodigal/
compare.py
requires the additional Python packages:
The packages can be easily installed with Pip or Conda.
python3 -m pip install --user biopython xopen pyrodigal
conda install -c conda-forge -c bioconda biopython xopen pyrodigal
usage: compare.py [-h] [--genome GENOME [GENOME ...]] [--prodigal PRODIGAL] [--closed] [--output OUTPUT]
Compare CDS predictions of Prodigal to the predictions of Pyrodigal and save the differences in a TSV file.
optional arguments:
-h, --help show this help message and exit
--genome GENOME [GENOME ...], -g GENOME [GENOME ...]
Input genomes (/some/path/*.fasta)
--prodigal PRODIGAL, -p PRODIGAL
Path to a newly compiled Prodigal binary.
--closed, -c Closed ends. Do not allow genes to run off edges.
--output OUTPUT, -o OUTPUT
Output path (default="./comparison")
- Input: single compressed genome
./compare.py --genome /path/to/genome.fasta.gz --prodigal /path/to/binary/prodigal
- Input: multiple genomes
./compare.py --genome /path/to/*.fasta --prodigal /path/to/binary/prodigal
- To download 50 test genomes you can use:
wget -i test_genomes.txt -P /some/path/
compare.py
can use a single or several (compressed) genomes in the FASTA format as input.
- A short summary of each comparison is printed to stdout:
Hits genome=GCF_000006765: prodigal=5681, pyrodigal=5681, equal=True
- The
comparison
output directory contains a mismatches.tsv TSV file with differing predictions. - The
comparison/tmp
directory contains the Prodigal train file and GFF output for each used genome.
Since Pyrodigal v2.0.0-rc.3 there are no more mismatches compared to Prodigal (commit 31b300a99a39964893057128ea10338e9a26bd6c, branch GoogleImport).
GNU General Public License v3.0