An AI-powered system for discovering novel drug-disease relationships using Relational Graph Convolutional Networks (R-GCN) and Retrieval-Augmented Generation (RAG). This project integrates heterogeneous biomedical knowledge graphs with natural language processing to predict and explain potential drug repurposing candidates.
The system addresses the high cost and time of traditional drug discovery by leveraging existing biomedical data to find "new tricks for old drugs." It predicts which drugs may treat specific diseases and provides human-interpretable explanations by citing mechanisms of action and literature.
- Graph Machine Learning: 3-layer R-GCN encoder and DistMult decoder on Hetionet (47,031 nodes, 2,250,197 edges).
- Explainable AI (XAI): RAG with LLMs to generate natural language explanations citing mechanisms of action and literature.
- Multi-Source Integration: Data from Hetionet, DrugBank, PubMed, KEGG, and Reactome.
- Scientific Paper Analysis: Specialized module for extracting chemical entities and summarizing PDF publications.
The system operates as a modular pipeline that connects deep learning on graphs with semantic search and large language models.
- Link Prediction (Deep Learning):
- The R-GCN model acts as the discovery engine. It analyzes the topology of the Hetionet graph (including gene associations and molecular functions).
- It calculates the probability of a "treats" relationship between a Compound and a Disease node.
- RAG Pipeline (Explanability):
- Once a high-probability prediction is made, the system queries a ChromaDB vector database containing DrugBank pharmacology and PubMed abstracts.
- An LLM synthesizes this evidence to provide a biological rationale.
Input: Drug + Disease Query
│
▼
┌──────────────────────────────┐
│ R-GCN Model │
│ (Heterogeneous Link Prediction) ──▶ Prediction Score (0.0 - 1.0)
│ - 3-Layer R-GCN Encoder │ Confidence Assessment
│ - DistMult Scoring Decoder│
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ RAG Pipeline │
│ (ChromaDB + OpenAI/LLM) │
├──────────────────────────────┤
│ 1. Vector Search (DrugBank) │──▶ Biological Context:
│ 2. Lit. Search (PubMed) │ - Mechanisms of Action
│ 3. Pathway Mapping (KEGG) │ - Pathway Intersections
└──────────────┬───────────────┘ - Literature Citations
│
▼
Human-Interpretable
Discovery Explanation
The "Paper Analysis" feature is a specialized sub-system designed to ingest unstructured research papers and output structured chemical intelligence.
- Extraction: Uses
pdfplumberandPyPDF2to extract text from multi-column scientific layouts. - Entity Recognition: Implements a regex-based Chemical Entity Recognition (CER) algorithm to find drug names and IUPAC nomenclature (e.g., words ending in
-ine,-ide,-ate). - Frequency Mapping: Scores importance based on mention frequency throughout the document.
For every extracted chemical, the system runs the following:
- Identity Validation: Queries the PubChem API via
pubchempyto verify the entity and fetch its SMILES string. - 2D Rendering: Uses RDKit to generate high-resolution molecular structure diagrams.
- Descriptor Calculation: Computes ADME/Tox-relevant properties:
- Molecular Weight (MW)
- Lipophilicity (LogP)
- Polar Surface Area (TPSA)
- H-Bond Donors/Acceptors
- Contextual Slicing: The system extracts the first and last 4,000 characters (Abstract/Introduction and Conclusion/Results) to avoid token limit issues while maintaining context.
- Structured Prompting: The LLM is instructed to return a JSON-structured summary covering:
- Objective: The core research question.
- Methods: Experimental or computational approaches used.
- Key Findings: Major results and statistical significance.
- Clinical Relevance: Potential for drug repurposing or therapeutic use.
# Environment Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e .
# Configure API Keys in .env
echo "OPENAI_API_KEY=your_key" > .env
echo "NCBI_EMAIL=your_email@example.com" >> .env
Full Workflow Execution: Refer SETUP GUIDE
- Prepare Data:
python3 scripts/download_data.py && python3 scripts/build_knowledge_graph.py - Train Discovery Model:
python3 scripts/train_model.py --epochs 50 --device cuda - Index Knowledge Base:
python3 scripts/index_documents.py --drugbank path/to/db.xml --pubmed - Run Dashboard:
streamlit run app/streamlit_app.py
app/: Streamlit UI and "Paper Analysis" dashboard.src/models/: R-GCN architecture and DistMult scoring implementation.src/paper_analysis/: Modules for PDF parsing, CER, and RDKit visualization.src/rag/: Vector database management and LLM prompt engineering.data/: Stores Hetionet JSON, processed.ptgraph files, and ChromaDB indices.
- Hetionet: Himmelstein, et al. (2017) eLife.
- R-GCN: Schlichtkrull, et al. (2018) ESWC.
- DrugBank: Wishart DS, et al. Nucleic Acids Res.
- PubMed: National Center for Biotechnology Information (NCBI).
