A command line tool for reconstructing pseudogenes in prokaryotic genomes.
# Clone the repository
git clone https://github.com/Floto-Lab/pseudomancer.git
cd pseudomancer
# Create conda environment with all dependencies
mamba env create -f environment.yml
mamba activate pseudomancer_env
Requirements: Mamba or Conda package manager
# Activate environment (if not already active)
mamba activate pseudomancer_env
# Run pipeline
python -m pseudomancer --taxon Mycobacterium --genome target_genome.fasta --out_dir results/
--taxon
: Taxon name for downloading reference proteins from NCBI RefSeq (e.g., 'Mycobacterium' or 'Mycobacterium tuberculosis')--genome
: Target genome FASTA file to search for pseudogenes--out_dir
: Output directory for results--evalue
: E-value threshold (default: 1e-5)--cluster
: Enable protein clustering at 99% identity (default: False)
- Identifies all open reading frames (ORFs) in the target genome using getorf
- Downloads all complete, annotated genomes for the specified taxon (can be a species or an entire genus) from NCBI RefSeq
- Extracts and merges protein sequences from all assemblies
- Optionally: Clusters proteins at 99% identity using mmseqs2 to create a non-redundant dataset
- Searches the (clustered or unclustered) proteins against your target genome using mmseqs2 (tblastn-like search)
- Outputs results in tabular format with alignment statistics