https://link.springer.com/article/10.1186/s12864-023-09924-y
https://zenodo.org/records/10125088
We provide a conda / mamba environment that contains the necessary tools to generate the data needed to annotate a novel genome. It was tested on Ubuntu but should function in any UNIX setting.
conda env create -f environment.yaml
conda activate annotexPlease note that in order to use our structural annotation pipeline you first need structure predictions for each protein.
In the case where the proteome of interest is already on UniProt, AlphaFold predictions can be downloaded from the AlphaFold Protein Structure Database (https://github.com/google-deepmind/alphafold/blob/main/afdb/README.md). Follow the instructions from DeepMind to create and setup an account. We provide a convenience script for downloading, renaming and splitting the predictions into domains in download_afdb.sh. The basic usage is
./download_afdb.sh TAX_ID FOLDER_TO_DOWNLOAD_TO
If you need to generate structure predictions we recommend using local ColabFold (https://github.com/YoshitakaMo/localcolabfold) if you have access to a GPU workstation. After completing the installation the entire proteome can be predicted batchwise with
colabfold_batch your_proteome.fasta result_dir/The predictions include many large json and image files. Save the config.json and cite.bibtex for future reference and the necessary (rank1 prediction for each protein) can be moved and renamed with the prepare_cf.sh convenience script
./prepare_cf.sh result_dir/
This will create one folder (pdbs) with the PDBs / json files and one with all well predicted domains (domains).
Split the predicted structures into domains using pae_splitter.py. This will be done automatically if using the convenience scripts. The basic usage is
python pae_splitter.py --prediction_dir YOUR_PREDICTIONS/ --source CF
Prepare the Foldseek databases
foldseek databases PDB pdb tmp
foldseek databases Alphafold/UniProt50 uniprot50 tmpRun the predicted proteome (folder with PDBs/CIFs) against the foldseek databases
foldseek easy-search your_proteome/ pdb foldseek_pdb tmp --alignment-type 2
foldseek easy-search your_proteome/ uniprot50 foldseek_uniprot tmp --alignment-type 2
foldseek easy-search domains pdb foldseek_pdb_domains tmp --alignment-type 1
foldseek easy-search domains uniprot50 foldseek_uniprot_domains tmp --alignment-type 1Download the nr from NCBI using your favourite tool and build the Diamond DB
aria2c -x 10 https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
tar xzvf nr.gz
diamond makedb --in nr.fasta -d nrRun Diamond
diamond blastp -d nr -q your_proteome.fasta -o diamond.tsv --ultra-sensitive
# Other sensitivities than --ultra-sensitive are viable if time is an issue
# See https://github.com/bbuchfink/diamond/wiki for other optionsThe easiest approach is to use the EggNOG webserver (http://eggnog-mapper.embl.de) for generating the EggNOG annotations. In the original manuscript the following none default parameters were used; Percentage identity: 15%, Minimum hit bit-score: 40. Save the file in the annotation folder as eggnog.tsv
DeepTMHMM is most easily also ran on their webserver (https://dtu.biolib.com/DeepTMHMM) but is limited to ~500 prediction in each batch which might not be viable for large proteomes. It can also be installed and ran locally with Docker but requires some tinkering. ANNOTEX expects the resulting .3line file as DeepTMHMM.3line in the annotation data folder. This step is not strictly required but might provide extra clues for transmembrane / exported proteins
TODO
After successful installation of ANNOTEX, using command “devel build PATH_TO_ANNOTATER_FOLDER; devel install PATH_TO_ANNOTATER_FOLDER; devel clean PATH_TO_ANNOTATER_FOLDER” in the ChimeraX command line, the user can load the previously created proteome predictions, by selecting the “Data folder” path and pressing the “Search” button (Supplementary Figure 3b).
The structure prediction for the first protein-coding gene will be displayed in the ChimeraX viewer, and the tool window will show the gene ID, a potential eggNOG annotation with a description of the protein function and E-value, if applicable, and a DeepTMHMM prediction that can be a global protein (GLOB), a TMD, a SP, or both TMD and SP. Below these descriptions, the structural matches from PDB (blue), AlphaFold microsporidian proteomes, and the available 20 model organisms (red) and AlphaFold Swissprot (yellow) are listed followed by the Diamond sequence-based blast hits (Supplementary Figure 3b). All entries are sorted according to their assigned E-value which can help to assess the certainty of a structural or sequence-based match.
The protein of interest is displayed as a complete model #1 in rainbow colors according to the AlphaFold confidence (Supplementary Figure 3a, d left panel). If the domains’ position relative to each other is of high uncertainty, the model can be hidden in the Models window (Supplementary Figure 3a, lower panel) and the individual, well-predicted domains become visible. Further, single domains in the list can have a subcategory that indicates a “missing structure” due to the exclusion/removal of highly disordered regions or flexible linkers. To visually inspect the structural matches, the user can click on an entry and subsequently, the corresponding protein structure will be superimposed (automatically applying the matchmaker in ChimeraX) to the investigated protein in the model viewer (Supplementary Figure 3a, d right panel). The superimposed structure can either be a single protein or a protein complex. In the latter case, the protein of the complex with structural similarity to the protein of interest will be highlighted in green.
If the structural match is from a PDB entry, the protein “chain” of the complex matching the protein of interest is added as a suffix to the PDB ID as shown in Supplementary Figure 3a (PDB 4pz6_A) and d (PDB 3j96_G). Further, for any selected structural hit an info panel will appear in the console log listing “ID”, “Title”, “Chain information” and “Parameters” (Supplementary Figure 3c). Once the user has found an adequate functional annotation, enabling the checkbox “Overwrite annotation on click” allows the protein being investigated to be named after the matching protein by clicking on the corresponding entry (Supplementary Figure 3b). The complete genome annotation can be saved in tsv-format facilitating accessibility of the file and further editing in a suitable program. ANNOTEX was successfully tested using ChimeraX version 1.3 and 1.4.
Next, we used each individual predicted 3D structure as input for Foldseek searches employing the alignment type 3Di+AA Gotoh-Smith-Waterman (local, default) and ran it against three different databases: 1. PDB, 2. AlphaFold database from the 20 first annotated model organisms (accession date: 07-15-2022), one representative of each microsporidian clade (Figure 1a), and 3. SwissProt AlphaFold. Additionally, for the E. cuniculi proteins, individual, well-predicted protein domains were automatically separated using the Predicted Aligned Error (PAE) (45) and subjected to the TM-align algorithm in Foldseek. As a measure of confidence, the E-value is displayed for all Diamond and eggNOG searches, while the significance of the Foldseek searches varies with the alignment type: The bit score assesses 3Di+AA Gotoh-Smith-Waterman search results and the TM score (global score) represents the confidence of TM-align searches. Further, to predict the overall 3D structure and the presence of a SP or TMD for each analyzed protein, the Deep Transmembrane Helix Hidden Markov Model (DeepTMHMM) (1.0.20) software (93) was used.
To combine and display the generated information and similarity matches for each V. necatrix input sequence, we developed a ChimeraX annotator plugin, that we named ANNOTEX (Supplementary Figure 3). It retrieves a list of all predicted V. necatrix protein 3D structures, shows the eggNOG annotation in the user interface (Supplementary Figure 3b), and presents a list of structural matches and sequence-based hits, respectively, along with corresponding confidence values. Further, the proteins corresponding to structural hits can be superimposed with the V. necatrix protein of interest, allowing for visual inspection of the structure match. Additionally, the overview of all structural and sequence hits per protein allows for manual curation and functional annotation according to the best match.
