This repository contains a set of scripts and workflows designed to search for Respiratory Complex I (NADH Ubiquinone Oxidoreductase) subunits in prokaryotic genomes and proteomes. The pipeline automates:
- Downloading genome and coding sequence (CDS) data.
- Pre-screening for annotated subunits.
- Merging sequences with InterPro data.
- Clustering and performing multiple sequence alignments (MSA).
- Building and utilizing HMM profiles for protein searches.
- Filtering and analyzing results, including statistical assessments of oxygen tolerance relationships.
Upon execution of 01_setup_dirs.py
, the following directories will be created:
project_root/
│-- config.py # Configuration file with project paths
│-- data/
│ ├── sequence_data/ # Stores genomic, CDS, and proteome data
│ │ ├── genomes/
│ │ ├── cds/
│ │ ├── proteomes/
│ ├── genomic_metadata/ # Metadata related to genomes
│ │ ├── genome_metadata/
│ │ ├── ncbi_genome_records/
│ ├── cds_metadata/ # Metadata specific to CDS
│ ├── external_metadata/ # Additional metadata from external sources
│ ├── hmm_data/ # Contains HMM profiles and related sequence data
│ │ ├── profiles/
│ │ ├── msa/
│ │ ├── interpro_prot_seqs/
│ │ ├── cds_prot_seqs/
│ │ ├── clustered_prot_seqs/
│ │ ├── clustered_msa_seqs/
│ │ ├── combined_interpro_cds_seqs/
01_setup_dirs.py
: Creates the necessary directory structure and generates aconfig.py
file with essential paths.
02_fetch_taxonomy_prepare_downloads.py
: Retrieves taxonomy information and generates FTP links for genome and CDS downloads.
03_download_genomes_cds.sh
: Bash script that downloads genome and CDS files using generated FTP links.
04_prescreen_cds.py
: Screens CDS files for annotated Complex I subunits.05_extract_genome_metadata.py
: Extracts genome metadata for further processing.06_extract_seqs_cds.py
: Extracts sequences from CDS files for later HMM profiling.
07_fetch_interpro_seqs.py
: Retrieves InterPro sequences and integrates them with extracted CDS sequences.
08_hmm_pipeline.py
: Constructs HMM profiles from MSA of clustered sequences.09_hmmer_search.py
: Uses HMMER to search proteomes for Complex I subunits.10_process_hmmer_results.py
: Process HMMER search results, combines into dataframe and saves them into csv format.
post_search_01.ipynb
: SAME as '10_process_hmmer_results.py'.post_search_02.ipynb
: Intergenic distances and hits cluster analysis.post_search_03.ipynb
: KDE evalues and hits cluster analysis.
- Set Up Project Directories:
python 01_setup_dirs.py
- Fetch Taxonomy and Prepare FTP Links:
python 02_fetch_taxonomy_prepare_downloads.py
- Download Genome and CDS Files:
bash 03_download_genomes_cds.sh
- Pre-screen CDS and Extract Metadata:
python 04_prescreen_cds.py python 05_extract_genome_metadata.py python 06_extract_seqs_cds.py
- Integrate InterPro Data and Build HMM Models:
python 07_fetch_interpro_seqs.py python 08_hmm_pipeline.py
- Run HMMER Search and Process Results:
python 09_hmmer_search.py python 10_process_hmmer_results.py
- Perform Post-Processing Analysis:
Open and run Jupyter notebooks
post_search_01.ipynb
andpost_search_02.ipynb
for visualization and statistical assessments.
- Python (≥3.8)
- Biopython
- HMMER
- MAFFT
- MMSeqs2
- Fasttree
- iqtree
- wget (for bash scripts)
- Jupyter Notebook (for post-processing analysis)
- Users can modify scripts to search for different protein families by adjusting HMM profiles and filtering criteria.
- The
config.py
file centralizes all path references, ensuring easy modification if the project structure changes.
For questions or issues, please reach out via GitHub Issues.