Skip to content

CDDLeiden/PyTEA-O

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTEA-O

This repository contains a Python implementation of the Two-Entropies Analysis as described in Ye et al., 2008 for protein sequence variation analysis.

Table of contents

Introduction

About protein sequence variation analysis based on a MSA of homologous sequences

Protein sequence variation analysis based on a multiple sequence alignment (MSA) provides evolutionary information to understand how sequence changes relate to protein function. By aligning homologous sequences from different species (or individuals), conserved residues that are critical for structural integrity or catalytic activity can be distinguished from variable regions. Highly conserved positions often indicate essential functional or structural roles. Thus, protein sequence variation analysis can be used to identify functionally relevant residues based on a MSA and prioritize protein variants for experimental study.

About Two-Entropies Analysis

Two-Entropies Analysis (TEA) was first introduced by Ye et al. in 2008. TEA is a method used to identify residues that account for functional differences among protein subfamilies. It applies Shannon entropy as a measure of sequence conservation, calculating it both across all sequences at a given alignment position (global entropy) and separately within each subfamily, where the values are then averaged. By comparing these two measures, TEA can distinguish residues that are universally conserved from those that are variable yet still functionally important within specific subfamilies.

An example of a figure where the average entropy is plotted versus the global entropy is shown below. The data shown is a synthethic dataset that is described in the original publication. This figure was adapted from the original TEA publication

example_SE

Each point in the figure represents a position in the MSA, and its location in the plot provides hints into its functional role:

Lower left corner: Positions here are highly conserved (low total entropy). These residues are usually critical for core functions of the protein, such as stability or catalysis.

Upper left corner: Positions are conserved within subfamilies (low average entropy) but differ between subfamilies (high total entropy). These are often specificity-determining positions.

Upper right corner: Positions show little conservation (both entropies high). These residues are typically less important for protein function.

TEA versus TEA-O

In the original Two-Entropies Analysis (TEA), users must manually define protein subfamilies, which can lead to inconsistent or non-reproducible results. To address this limitation, Ye et al. later introduced an objective version of the algorithm (TEA-O), which automatically partitions the sequences into subgroups and then averages the Shannon entropy at each alignment position across these objectively defined groups. This automated procedure removes subjectivity from subgroup selection and results in more reproducible identification of residues linked to functional divergence.

Why this Python implementation?

The original source code is no longer available, so we re-implemented the method in Python.
This ensures the approach remains accessible, reproducible, and easy to integrate into modern bioinformatics pipelines

Setup

Prerequisites

Creating a Conda Environment

  1. Clone the repository

    git clone https://github.com/CDDLeiden/PyTEA-O.git
    cd PyTEA-O
  2. Create a new Conda environment

    conda env create --file environment.yaml -n TEAO
  3. Activate the Conda environment:

    conda activate TEAO
  4. (Optional) Deactivate the Conda environment:

    conda deactivate

Usage

For performing a TEA, the only input needed is a MSA in FASTA or CLUSTAL format. To quickly get started, we provided an example MSA in the /example directory. The MSA consists of homologous ubiquitin sequences. Running the TEA calculations using the provided MSA allows you to verify that everything is working as expected.

Running the TEA(O) calculations

First, you can create subgroups using the UPGMA algorithm

python src/upgma_subfam_grouping.py \
    -m example/ubiquitin.fasta.aligned \
    -o example/results/subfamilies

Next, the entropy calculations can be started using

python src/run_TEA.py \
    -m example/ubiquitin.fasta.aligned \
    -o example/results/SE \
    -r  "KAF8060592" \
    -t 12 \
    -e "TEAO"
    -f example/results/subfamilies/upgma.subfamilies \
    -c config/graph_results.cfg

Where -r is the ID of the reference sequence, -t is the number of threads to use and -f is the subfamily file that is created within the upgma_subfam_grouping.py

Interpreting the results

When the code is executed, several output files are generated. The main visualization can be found in shannon_entropy.png, which contains a multi-panel figure summarizing entropy calculations and related characteristics for each alignment position. The example figure below shows the output for our sample MSA.

TEA_ubiquitin

panel 1: Shannon Entropy values

This panel compares global entropy with average entropy across all alignment positions.
Examining global vs. average entropy can help identify residues that are potentially functionally important.

If you want to plot entropy in other ways, the raw values are available in:

  • TEA.superfamily_entropy (global entropy values per residue)
  • TEA.average_entropy (average entropy values per residue)

panel 2: Specificity/conservation scoring

Panel 2: Specificity and conservation scores
This panel shows conservation and specificity scores for each alignment position. These are calculated as:

residue_conservation_scores = -sqrt((ase_values)^2 + (se_values)^2) + 0.5 residue_specificity_scores = -sqrt((ase_values)^2 + (se_values - 1)^2) + 0.5

  • A higher conservation score indicates a position that is strongly conserved across the alignment.
  • A higher specificity score suggests that a residue is differentially conserved between protein families, potentially pointing to specificity-determining positions.

panel 3: Residue presence matrix

Each amino acid is represented by a distinct color, while white indicates absence at a given position.
This panel makes it easy to spot which residues are present at each alignment position. The data for this plot is stored in consensus.logo.

panel 4: Residue variants

This panel reports the total number of different amino acids (out of the 20 natural residues) present at each alignment position.
High variability suggests low conservation, while low variability indicates strongly conserved residues.

panel 5: Residue frequency distribution

Using the same color scheme as panel 3, this plot shows the frequency of each residue at every alignment position.
This provides a more detailed view of amino acid distributions.

panel 6: Z-scale heatmap

This panel visualizes the standard deviations for Z-scale 1–3, which correspond to:

  1. Lipophilicity
  2. Steric bulk/polarizability
  3. Polarity/charge

Darker colors indicate greater variability in these properties at a given position.
More information about Z-scales can be found in Sandberg et al., 1998.

The individual Z-scores are available in prot_descriptors_ZscaleSandberg.tsv.

License

The software is licensed under the CC-BY-NC, which means it is free to use, remix, and build upon non-commercially as long as the copyright terms of the license are preserved. Commercial use is not permitted. You can view the LICENSE file for the full terms. If you have questions about the license or the use of the software in your organization, please contact Gerard J.P. van Westen at gerard@lacdr.leidenuniv.nl.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages