Interpretable prediction of nucleic acid-binding proteins using a protein language model
Hanjin Kim, Sung-Gwon Lee, Jooseong Oh, Kee K. Kim, Eun-Mi Kim, and Chungoo Park
Accurate identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) is critical for elucidating transcriptional and post-transcriptional regulatory mechanisms. However, existing computational approaches often rely on inferred labels or domain-specific annotations, which limit the subsequent generalizability. Thus, this study aimed to introduce transformer-based classifiers for human DBPs and RBPs that rely solely on protein sequence information without engineered features or domain constraints. The models were implemented using ESM-2 with low-rank adaptation (LoRA) fine-tuning and trained on experimentally validated datasets, including chromatin immunoprecipitation sequencing (ChIP-seq) annotations for DBPs and eCLIP annotations for RBPs. Next, to evaluate biological relevance, we computed value-aware attention (VAT) scores aggregated across transformer layers to interpret model focus. In 20-fold cross-validation, the DBP model achieved an area under the receiver operating curve (AUROC) of 0.84 with a Matthews correlation coefficient (MCC) of 0.40, while the RBP model achieved an AUROC of 0.92 with an MCC of 0.46. Proteins predicted as nucleic acid-binding were enriched for known binding domains, and inspection of attention distributions revealed preferential focus on annotated functional regions rather than non-binding segments. These results demonstrate that attention-based protein language models can accurately identify nucleic acid-binding proteins directly from sequence data. Moreover, these models reveal biologically meaningful sequence determinants of binding, establishing an interpretable and scalable framework for proteome-wide characterization of protein–nucleic acid interactions
DNA-binding proteins; RNA-binding proteins; protein language model; ESM-2; deep learning; attention mechanism
This project classifies protein sequences as DNA-binding proteins (DBPs) or RNA-binding proteins (RBPs) using a transformer-based sequence model with lightweight fine-tuning and interpretable attention analysis.
- Backbone model:
facebook/esm2_t33_650M_UR50D - Fine-tuning: LoRA on attention projections (
query,key,value) - Tasks: binary classification for DBP and RBP
- Interpretability: value-aware attention (VAT) token scoring
Human protein sequences (18,221 protein-coding genes; GRCh38.p12) with experimental annotations:
- DBP positives: 1,447 (ChIP-seq, microarray) from hPDI and hTFtarget
- RBP positives: 351 (eCLIP) from ENCODE
- Dual-binding proteins: 90
- Negatives: 16,774 (DBP) and 17,870 (RBP)
Raw files are under data/raw_data/. Cross-validation outputs are under data/DRBP_cv_results/.
- 20-fold cross-validation with 95% train / 5% test per fold
- Sequence length: max 1,024 residues (truncate/pad)
- Positive oversampling: 12x (DBP) and 50x (RBP)
- Threshold selection: MCC-maximizing cutoffs (DBP 0.7, RBP 0.9)
Value-aware attention (VAT) scores localize to known nucleic acid-binding domains, including zinc finger and homeodomain regions for DBPs and RNA recognition motifs (RRM) for RBPs. Domain-level enrichment and residue-level profiles show consistent attention concentration in annotated binding regions.
src/scripts/DRBP_train.py: train LoRA models for BP/NBP tasks for DBP and RBP(CV-based)src/scripts/DRBP_predicton.py: evaluate models and write prediction CSVssrc/scripts/DRBP_att_score.py: compute VAT token scores for validation sequencessrc/notebooks/: data EDA and result analysis notebooksdata/: raw FASTA files and cross-validation results
Training (single CV):
python src/scripts/DRBP_train.pyEvaluation (single CV):
python src/scripts/DRBP_predicton.py --kind D --cv 1Evaluation (CV range):
python src/scripts/DRBP_predicton.py --kind R --cv-start 1 --cv-end 20VATP scoring:
python src/scripts/DRBP_att_score.py --kind D --cv 1 --pad-len 40 --prefix date- Predictions:
New_new_DRBP/RBP_result/*.csv(seeDRBP_predicton.py) - VAT scores:
New_DRBP/*VATP_result*/
Please document exact versions and hardware in the final submission:
- OS, CUDA, GPU model and driver
- Python version
torch,transformers,peft,scikit-learn,pandas,tqdm
All scripts for preprocessing, training, and evaluation are provided in this repository. Trained model weights are included here. For paper submission, add the public repository link and DOI as needed.
CP conceived, designed, and led the study and wrote the manuscript. HK and SGL contributed equally to this work, including the development of the computational framework, data analysis, and preparation of the initial draft. JO assisted with data curation. KKK and EMK provided expert consultation. All authors reviewed and approved the final manuscript.