GO2Sum 2.0 is a deep learning framework for generating concise, human-readable summaries of protein functions using Gene Ontology (GO) terms. This version builds upon the original GO2Sum by integrating transformer models like T5, TinyLlama, and a custom T5-MoE (Mixture of Experts) model to improve generalization and interpretability.
- Summarization of GO document descriptions for protein function prediction.
- Multi-model support:
T5
,TinyLlama
, and MoE-enhanced variants. - Clean, modular training scripts for each architecture.
- Preprocessing utilities for GO term documents.
- Support for
ROUGE
-based evaluation and dataset filtering.
GO2Sum_2.0/
├── data/ # Training, validation data
├── preprocessing/ # Scripts for preprocessing inputs
├── dataset.py # GODocDataset class for input handling
├── train_t5.py # Standard T5 training
├── train_t5_moe.py # T5 with Mixture of Experts
├── train_tinyllama.py # TinyLlama training script
├── generate_go_doc.py # Generate GO document from GO term list
├── requirements.txt # Python dependencies
├── check_conda_versions.sh # Utility to verify conda versions
├── README.md # This file
└── ...
-
Clone the repository:
git clone https://github.com/SwagarikaGiri/GO2Sum_2.0.git cd GO2Sum_2.0
-
Create and activate Conda environment:
conda create -n go2sum_env python=3.10 conda activate go2sum_env
-
Install required Python packages:
pip install -r requirements.txt
Required files under data/
:
train_dataset_cleaned.tsv
val_dataset_cleaned.tsv
gene_ontology.obo
Each row in TSV must contain:
input
— GO term listtarget
— Human-readable functional summary
-
T5 model:
python train_t5.py
-
T5 with Mixture of Experts:
python train_t5_moe.py
-
TinyLlama model:
python train_tinyllama.py
Generate textual descriptions from GO term list using:
python generate_go_doc.py
You can compute ROUGE scores using:
python train_casual_rouge.py
Verify conda environment versions:
bash check_conda_versions.sh
Contributions and feature suggestions are welcome!
- Fork the repository
- Create a new branch:
git checkout -b feature/my-feature
- Commit your changes
- Push to the branch
- Open a pull request
This project is licensed under the MIT License.
Swagarika Giri
PhD Candidate, Purdue University
🔗 GitHub