This tool allows an automatic identification and analysis of MYBs in a CDS or peptide sequence collection. Since the quality of structural annotation has substantially improved during the last years, the focus can now shift towards functional annotation. RNA-seq and long read sequencing of transcripts were fundamental for this development. There are numerous publications about R2R3-MYBs in various plant species. This tools allows the automatic investigation of the MYB gene family in any new species. The provided MYB sequences should be sufficient as baits in most plant species thus no manual preparation of input files is required. Results are a FASTA file containing the bona fide MYB sequences and a summary table with additional details. MYB candidates are screened for the presence of the conserved repeats and also for many other motifs. The function of all MYB candidates is predicted based on orthology to a previously characterized sequence (often in Arabidopsis thaliana).
While some dependencies are required, this tool does not require an installation. Downloading and executing the script on a Linux system is sufficient. There is currently no support for other operating systems. Most required modules are included in the initial Python installation, but dendropy might not be available on all systems.
git clone
cd MYB_annotator
how to install conda(click to expand)
echo 'export PATH="~/anaconda/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
conda env create -f environment.yml
conda activate mybannotator
Python3 (sudo apt-get install python3.8
). It is also possible to use other Python3 versions.
dendropy (sudo apt install python3-pip && python3 -m pip install git+
MAFFT (sudo apt-get install -y mafft
BLAST (sudo apt-get install ncbi-blast+
) or HMMER(conda install -c bioconda hmmer
FastTree (sudo apt-get install -y fasttree
) and/or RAxML-NG (precompiled binaries recommended)
python3 --baits <FILE> --info <FILE> --out <DIR> [--subject <FILE> | --subjectdir <DIR>]
--baits STR MYB bait sequence file
--info STR MYB info file
--out STR Output folder
--subject STR Subject file (OR: --subjectdir STR Subject dir)
--search STR Tool for initial search (blast|hmmer)[blast]
--mode STR Tool for tree construction (fasttree|raxml)[fasttree]
--hmm STR HMM file of MYB baits
--refmybs STR Reference MYB file
--cpu INT Number of threads [4]
--cdsinput NONE Changes expected input to CDS
--keepnames NONE Prevents splitting of sequence names at first space
--ath STR FASTA file with A. thaliana MYBs
--name STR Prefix of output file names
--collapse NONE Reduces paralogs to one representative
--motifs STR File with motifs to check in candidate sequences
--cpub INT Number of threads for BLASTp[cpu]
--cpur INT Number of threads for RAxML[cpu]
--mafft STR Path to MAFFT [mafft]
--blastp STR Path to blastp [blastp]
--hmmsearch STR Path to hmmsearch [hmmsearch]
--makeblastdb STR Path to makeblastdb [makeblastdb]
--fasttree STR Path to FastTree [fasttree]
--raxml STR Path to RAxML [raxml-ng]
--simcutp FLOAT BLASTp similarity cutoff [60]
--poscutp INT Max number of BLASTp hits per bait [100]
--lencutp INT Min BLASTp alignment length [75]
--numneighbours INT Neighbours to consider for classification [10]
--neighbourdist FLOAT Cutoff in neighbour identification [3]
--minneighbours INT Minimal number of bait neighbours for classification [0]
--paralogdist FLOAT Distance cutoff in paralog maksing [10]
specifies a FASTA file that contains the MYB bait sequences. Some of these sequences are bona fide MYBs and other are MYB-like sequences that are used as outgroup.
specifies a text file that contains a table with all MYB IDs (matching the baits FASTA file) in the first column. Columns are TAB-separted. The second column contains an assignment to the ingroup ('in') or outgroup ('out'). Ingroup and outgroup baits are used to classify MYB candidates as bona fide MYBs or MYB-likes. For example, MYB-like/CDC5 sequences of A. thaliana are marked as outgroups in the baits_MYBs_info.txt. They are considered the closest known relatives of MYB sequences.
specifies the output folder. This folder will be created if it does not exist already. All files generated during the analysis will be placed in this folder. There is a subfolder called 'RESULTS' that contains the important final files.
specifies the FASTA file that should be analysed. The default expectation is that this file contains peptide sequences. However, it is also possible to supply a collection of coding sequences (CDS) if the --cdsinput
flag is set. Providing a file with only primary transcript/peptide sequences can result in different results compared to collection of all isoforms. This is especially important with respect to the MYB domain which might be incomplete in some isoforms. If possible, avoid any special characters in the headers of your supplied sequences. However, the MYB_annotator contains a filtering step that will remove some characters. For example, tabs will be replaced by three spaces.
specifies a folder with FASTA files. All files need to contain sequences of the same type (PEP or CDS). Supported file extensions are '.fasta', '.fas', '.fa', '.FASTA', '.FAS', '.FA', '.fna', '.FNA', '.cds', '.CDS', '.pep', and '.PEP'.
specifies the tool for the initial identification of candidates. BLAST or HMMER can be used for this purpose. Default is BLAST.
specifies the tool for the construction of phylogenetic trees. RAxML ('raxml') or FastTree2 ('fasttree') can be specified. Default is 'fasttree'. Running the tree construction with RAxML is expected to result in slightly more accurate results, but would increase the run time from minutes to days in many cases.
specifies a HMM file that represents the bait MYBs. This file is required when using hmmsearch for the identification of initial candidates.
specifies a text file with one MYB ID per line. All IDs listed in this file need to be present in the baits FASTA file and the info file. These IDs can be used for the functional annotation of the newly identified MYBs. This assignment is performed in two ways leading to two different output tables. The functional annotation is lifted from previously characterized sequences to newly identified orthologs.
specifies threads to use for the BLASTp search and also for the RAxML tree construction. Default: 4.
specifies specifically the number of threads to use for the BLASTp search. This option can be used if the number of cores to use for BLASTp and RAxML is different. Default is the value of --cpu
specifies specifically the number of threads to use for the RAxML analysis. This option is only relevant if RAxML is used for the tree construction. This option can be used if the number of cores to use for BLASTp and RAxML is different. Default is the value of --cpu
changes the input expectation from PEP to CDS. The file needs to be supplied through the --subject
. This flag can be included at any place in the command and does not need a value. The coding sequences will be translated pior to the analysis.
this flag deactivates the trimming of the names of input sequences at the first space. This flag can be included at any place in the command and does not need a value.
specifies a FASTA file with Arabidopsis thaliana MYB sequences that are used to construct a final tree with the candidate sequences and these reference MYB sequences.
specifies a string that will be attached as prefix to all file names in the result folder. This option is useful if the result files of different analyses will be combined in one folder in a later step of the project. This name should NOT use any special characters. Just using the numbers 0-9 and the normal characters a-z or A-Z should be save.
this flag triggers and additional analysis step. Group of paralogs will be represented by a single (the longest) sequence. This is helpful if a transcriptome assembly is analyzed which might have many isoforms.
specifies a file with motif sequences and names. All candidate sequences will be screened for these motifs. Many subgroups are characterized by conserved motifs outside the R2R3 domain. This option allows an automatic analysis of all candidates and summarizes the results in a table.
specifies the full path to MAFFT including the binary.
specifies the full path to BLASTp including the binary.
specifies the full path to hmmsearch including the binary.
specifies the full path to FastTree2. This is required if fasttree is not included in the PATH variable. If in doubt, please specify the full path.
specifies the full path to RAxML-NG.
specifies the minimal similarity of BLASTp hits to be considered in the initial identification of MYB candidates. Default: 60.0 (60% similarity).
specifies the maximal number of hits (possibilities) per bait sequence in the BLASTp analysis. Default: 100.
specifies the minimal alignment length of BLASTp hits to be considered in the initial identification of MYB candidates. Default: 75.
this specifies the number of neighbouring sequences in a phylogenetic tree that will be considered for the classification as MYB or MYB-like, respectively. Default: 10.
specifies the factor by which the average branch length is multiplied when deciding about the consideration of neighbouring sequences, because sequences sitting on very long branches are ignored in this step. Increasing this value will enhance the sensitivity of sequence detection at the cost of specificity. Larger values should be tested when looking for completely new MYB clades in phylogenetic lineages that are not well covered by the collection of bait sequences. Default: 3. When searching for novel MYB clades that might not contain a single reference MYB sequence, increasing this value to 10 is recommended.
specifies the minimal number of neighbours that is required for the classification of sequences. If there are no close neighbours, a sequence might be sitting at an extremely long branch. Therefore, it can be appropriate to exclude sequences without any neighbouring sequences from the set of clean MYBs. Default: 0.
specifies the maximal distance of paralogs in one clade during the masking/collapsing step. A clade of paralogs can be reduced to one representative sequence. This is particularly helpful when working with transcriptome assemblies that contain an inflated number of alternative transcripts. Default: 10.
WARNING: If errors occur or if parameters are changed, it is necessary to delete all output files to get clean files based on the new parameters. Existing files are usually not overwritten.
This example should help to understand how the script is used. The required MYB baits FASTA file, the MYB info, and the optional reference MYB file (Arabidopsis thaliana MYBs) are included in this repository.
python3 /my_folder/ \
--baits /my_folder/data/bait_MYBs.fasta \
--info /my_folder/data/bait_MYBs.txt \
--out /my_folder/test_analysis/ \
--subject /my_folder/data/Croton_tiglium_peptides.fasta \
--mode raxml \
--refmybs /my_folder/data/AthRefMYBs.txt \
--raxml /my_folder/bin/RAxML9/raxml-ng \
While running the script, several temporary files are generated, but there are not important. Please only look at files located in the 'RESULTS' folder. Here is a list of the possible result files. The generation of some files depends on setting specific options. Therefore, it is possible that you do not see all these files in your result folder. If you specified a prefix (--name), all file names with start with that prefix and thus differ in this respect from the file names listed here.
00_documentation.txt this file stores the command used to run the analysis and also the versions of the tools used for the individual steps. Information contained in this file should be included in the method section when writing a publication about the results of this analysis.
01_initial_candidates.fasta this FASTA file contains all sequences in your subject set that have some similarity to the bait MYB sequences. This selection depends on the setting that you used for the BLASTp search. Usually, this is a very large collection of sequences that have only small similarity to MYBs. Many of these sequences will be discarded in the next step, but this collection is important to achieve high sensitivity. Truncated MYBs could be of high interest and should not be discarded due to short alignments.
02a_clean_MYBs.fasta this FASTA file contains all MYB candidates that fall into the clade of the bait MYB sequences i.e. all MYB-like sequences should be removed at this stage. The accuracy might be low at the basis of the MYB clade, but this approach should allow the inclusion of truncated MYBs that are sitting deep in one of the MYB subgroups.
02b_in_out_MYB_analysis_results.txt this file provides detailed information about the evidence for inclusion or exclusion, respectively, of each initial MYB candidate. The decision is based on phylogenetic distances between the sequences in your subject set and the bait MYBs supplied. If a candidate is close to many bait sequences that are considered bona fide MYBs, the sequence will be included. If a candidate is located in a clade with MYB-like bait sequences, it will NOT be considered a MYB.
- Score = Ratio of ingroup matches to outgroup matches.
- IngroupMatches = Number of nearest neighbours (only baits considered) in the tree that are ingroup members.
- OutgroupMatches = Number of nearest neighbours (only baits considered) in the tree that are outgroup members.
03a_group_around_ref_MYBs.txt this file is a table which assigns all MYB candidates in the subject species to the provides reference sequences. This file is only generated if such reference sequences are provided (--refmybs).
03b_new_2_ref_myb_mapping_file.txt this file is table which assigns the best fitting reference MYB (hopefully the ortholog) to each of the candidate MYBs in your subject data set. This file is only generated if such reference sequences are provided (--refmybs).
- EdgeDistance = Number of edges between two leaves (sequences) on the tree.
- PatristicDistance = Cumulative length of all nodes between two leaves (sequences) on the tree.
04a_MYB_domain_check.txt this table contains the results of a search for the MYB domains (1R, R23R, 3R, and others) in all of the clean MYB candidates. Regular expressions are used to identify 3R, R2R3, or 1R motifs. If none of them is detected, the candidate is classified as other/pseudo-MYB. Here are the regular expressions used for the three domains:
R1 = '\w{3,4}W\w{17,21}W\w{17,21}W\w{5,8}'
R2 = '\w{5}[WF]{1}\w{18,21}W\w{15,27}[WY]{1}\w{4}' #F at pos1 and Y at pos3 are very rare
R3 = "\w{5}[WLIMF]{1}\w{14,21}W\w{17,21}[WYF]{1}\w{4}" #diversity at pos1 is high, but po2 and pos3 are conserved
These MYB domain patterns are based on Feng et al., 2017, Du et al., 2015, and Pucker et al., 2020. The category 'others' comprises all sequences that lack one of the afore mentioned domains. Genes can be included in this category if the analyzed transcript/peptide isoforms lacks a part of the domain. This can happen if only the 'primary' transcript/peptide sequence is provided.
04b_motif_check.txt this table contains the results of a search for a set of provided motifs in all of the clean MYB candidates. This file is only generated if a file with motifs was supplied (--motif).
04c_MYB_domain_check.fasta this file contains all MYB domains identified in the clean MYB candidates.
04d_MYB_domain_check.doc.txt this file quanifies the number of clean MYB candidates containing a certain domain. Statistics collected in this file can be combined into a summary table if more than one subject file is analyzed.
05 this tree is generated based on all MYB bait sequences and the identified clean MYB candidates of your subject file. The tool used for the tree construction depends on your settings.
06 this tree is generated based on all Arabidopsis thaliana and the identified clean MYB candidates of your subject file. The tool used for the tree construction depends on your settings.
07a_repr_MYBs.txt this file is generated if you choose to represent groups of paralogs by only one representative sequence (--collapse).
07b_repr_MYBs.fasta this file is generated if you choose to represent groups of paralogs by only one representative sequence (--collapse). The sequences contained in this FASTA file represent one group of paralogs. All singletons are included as well.
08a_repr_ath_MYBs.fasta this file contains all representative sequences (see #7) and also the Arabidopsis thaliana MYB sequences. Therefore, this file is only generated if groups of paralogs are collapsed to one representative sequence (--collapse) and the necessary Arabidopsis thaliana MYB sequences are provided (--ath).
08b this tree is generated based on all Arabidopsis thaliana and the representative MYB candidate sequences. The tool used for the tree construction depends on your settings. This tree file is only generated if the file '08a' was generated.
08d_MYB_domain_check.txt please seee the full description at #04a for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
08e_MYB_domain_check.fasta please seee the full description at #04c for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
08f_MYB_domain_check.doc.txt please seee the full description at #04d for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
4_domain_detection_summary.txt this file is a summary of the MYB domain check results described in #04b across all analyzed species. This file is only generated if more than one species are investigated.
8_domain_detection_summary.txt this file is a summary of the MYB domain check results described in #08f across all analyzed species. It is similar to the one described above, but restricted to the representative sequences of paralog groups. This file is only generated if more than one species are investigated.
The MYB bait sequences are intended to cover a broad phylogenetic range. The bait sequence set comprises the MYB doamins of:
Arabidopsis thaliana (Stracke et al., 2001)
Beta vulgaris (Stracke et al., 2014)
Musa acuminata (Pucker et al., 2020)
and a collection by Du et al., 2015: Medicago truncatula, Populus trichocarpa, Citrus sinensis, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum, Aquilegia coerulea, Oryza sativa, Zea mays, Amborella trichopoda, Picea abies, Selaginella moellendorfii, Physcomitrella patens, Chlamydomonas reinhardtii, Volvox carteri, Micromonas pusilla, Ostreococcus lucimarinus, and Cyanidioschyzon merolae
