Master's Thesis Research | Department of Computer Engineering, Aligarh Muslim University
Author: Hasan Shaikh | Supervisor: Prof. Rashid Ali
- Overview
- Key Results
- How It Works
- Dataset
- Detailed Results
- Getting Started
- Project Structure
- Future Work
- Citation
- Contact
- FAQs
The Challenge: Traditional cancer survival models struggle to integrate heterogeneous medical data (clinical records, genomics, gene expression) and often miss high-risk patients.
Our Solution: A two-stage deep learning pipeline that:
- Learns modality-specific features using Gated Attention CNNs
- Fuses information via Random Forest ensemble
- Achieves 91.2% accuracy and identifies 79.8% of high-risk patients
Why It Matters: Current methods identify only ~22-45% of high-risk patients. Our approach nearly doubles sensitivity while maintaining high precisionβcritical for clinical deployment.
| Metric | Our Method | Best Competitor | Improvement |
|---|---|---|---|
| Accuracy | 91.2% | 90.2% (Stacked RF) | +1.0% |
| AUC | 0.950 | 0.930 (Stacked RF) | +0.020 |
| Sensitivity | 79.8% | 74.7% (Stacked RF) | +5.1% β |
| Precision | 84.1% | 84.1% (Stacked RF) | Tied |
Key Strength: Our method identifies 8 out of 10 high-risk patients, compared to only 2-4 out of 10 for traditional approaches.
| Method | Accuracy | Sensitivity | AUC | Clinical Utility |
|---|---|---|---|---|
| Our Method | 91.2% | 79.8% | 0.950 | Highest β¨ |
| Stacked RF [12] | 90.2% | 74.7% | 0.930 | High |
| MDNNMD [18] | 82.6% | 45.0% | 0.845 | Moderate |
| SVM [21] | 80.5% | 36.5% | 0.810 | Moderate |
| Random Forest [20] | 79.1% | 22.6% | 0.801 | Low |
| Logistic Reg. [19] | 76.0% | 18.3% | 0.663 | Low |
Impact: Traditional ML methods miss 75-80% of high-risk patients. Ours misses only 20%.
How to read: Our method (highlighted in gold) achieves top performance across all four metrics. Notice particularly the sensitivity bars (green)βour method significantly outperforms all competitors in identifying high-risk patients.
How to read: Darker green = better performance. Our method (top row) shows consistently dark green across all metrics, demonstrating superior and balanced performance.
Why This Matters:
- Traditional Random Forest: Identifies only 2 out of 10 high-risk patients (22.6%)
- Our Method: Identifies 8 out of 10 high-risk patients (79.8%)
- Clinical Impact: Nearly 4Γ improvement in catching high-risk casesβcritical for timely intervention
Ablation Study Results:
Single Modality Performance:
βββ Clinical Only: 81.3% accuracy, 41.3% sensitivity
βββ Gene Expression: 84.1% accuracy, 50.5% sensitivity
βββ CNA Only: 89.3% accuracy, 70.2% sensitivity
Multimodal (Combined): 91.2% accuracy, 79.8% sensitivity β¬οΈ +9.6% sensitivity gain!
Insight: Each data type captures different aspects of cancer biology. Combining them provides a complete picture.
Our approach consists of two sequential stages:
Stage 1: Modality-Specific Feature Extraction
- Three independent Gated Attention CNNs process each data type
- Each CNN learns optimal representations for its modality
- Extract 50 (clinical), 525 (expression), 200 (CNA) features
Stage 2: Ensemble Learning
- Concatenate all 775 features
- Train Random Forest with 200 trees
- Predict survival outcome (high-risk vs low-risk)
Feature Transformation at Each Stage:
- Clinical: 25 β 50 features (2.0Γ expansion)
- Gene Expression: ~400 β 525 features (~1.3Γ expansion)
- CNA: ~200 β 200 features (1.0Γ maintained)
- Combined: 775 total features for ensemble prediction
Key Observation: Gene Expression CNN expands features (~1.3Γ), Clinical CNN doubles features (2.0Γ), while CNA CNN maintains dimensionality (1.0Γ)
What makes it special?
Traditional CNNs treat all features equally. Our Gated Attention mechanism learns to:
- Focus on important features
- Suppress irrelevant information
- Adapt to each data modality
How it works:
Input Features β Conv1D β [Gateβ β Features] β MaxPool β Learned Features
β [Gateβ β Features] β
(β = element-wise multiply)
Benefits:
- β Better feature selection
- β Reduced overfitting
- β Improved interpretability
| Component | Configuration |
|---|---|
| Input Processing | Reshape to (N, features, 1) for Conv1D |
| Multi-Branch Conv | Parallel kernels (k=1, k=2) |
| Gating Paths | Two gates per branch (k=1, k=3) |
| Activation | ReLU for gates, tanh for dense layers |
| Regularization | L2 (0.001) + Dropout (0.25) |
| Dense Layers | 150 β 100 β 50 neurons |
| Final Ensemble | 200 balanced Random Forest trees |
| Training | 10-fold stratified CV, 25 epochs |
METABRIC (Molecular Taxonomy of Breast Cancer International Consortium)
| Attribute | Details |
|---|---|
| Size | 1,980 patients |
| Modalities | Clinical (25 features) + Gene Expression (~400) + CNA (~200) |
| Outcome | Binary: β₯5 years survival (low-risk) vs <5 years (high-risk) |
| Source | cBioPortal |
Data Types:
- Clinical: Age, tumor size, grade, ER/PR/HER2 status, lymph node involvement
- Gene Expression: mRNA levels (discretized: -1, 0, +1)
- Copy Number Alterations (CNA): Chromosomal gains/losses
Preprocessing:
- Missing values: Median imputation
- Clinical features: Z-score normalization
- Validation: 10-fold stratified cross-validation
Four-Panel Analysis:
- Top-Left (Accuracy): CNA provides best single-modality accuracy (89.3%), but combining all modalities reaches 91.2% (+1.9%)
- Top-Right (Precision): CNA and multimodal tied at 84.1%, showing genomic features drive precision
- Bottom-Left (Sensitivity): Largest improvement from multimodal integration (+9.6% over best single modality)
- Bottom-Right (AUC): Gene Expression has best single-modality AUC (0.923), multimodal reaches 0.950 (+0.027)
Critical Insight: Each modality excels at different aspectsβcombining them leverages all strengths.
Final Model (10-Fold Cross-Validation):
| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 91.2% | Correctly predicts 9 out of 10 patients |
| AUC | 0.950 | Excellent discrimination (>0.9 is outstanding) |
| Sensitivity | 79.8% | Identifies 8 out of 10 high-risk patients |
| Precision | 84.1% | 84% of predicted high-risk are truly high-risk |
Individual Modality Performance:
| Modality | Accuracy | Precision | Sensitivity | AUC | Key Strength |
|---|---|---|---|---|---|
| Clinical Only | 81.3% | 71.2% | 41.3% | 0.834 | Interpretable, readily available |
| CNA Only | 89.3% | 84.1% | 70.2% | 0.850 | Genomic instability markers |
| Gene Expression Only | 84.1% | 77.9% | 50.5% | 0.923 | Molecular pathway information |
| All Combined | 91.2% | 84.1% | 79.8% | 0.950 | Complementary information β¨ |
Key Findings:
- π― CNA has highest single-modality accuracy (89.3%)
- 𧬠Gene Expression has best single-modality AUC (0.923)
- π₯ Clinical provides strong baseline with easy-to-collect data
- π Multimodal beats all single modalities across every metric
Improvement Over Baselines:
Absolute Improvement Over Competing Methods:
| Baseline | Accuracy Gap | Sensitivity Gap | AUC Gap |
|---|---|---|---|
| Logistic Regression | +15.2% | +61.5% | +0.287 |
| Random Forest | +12.1% | +57.2% | +0.149 |
| SVM | +10.7% | +43.3% | +0.140 |
| MDNNMD (DL) | +8.6% | +34.8% | +0.105 |
| Stacked RF (SOTA) | +1.0% | +5.1% | +0.020 |
Clinical Impact:
- Traditional Random Forest: Identifies only 23% of high-risk patients
- Our Method: Identifies 80% of high-risk patients
- Result: 3.5Γ improvement in sensitivityβcritical for patient outcomes
Sensitivity Improvement Breakdown:
Clinical β CNA: +28.9 points (41.3% β 70.2%)
CNA β Multimodal: +9.6 points (70.2% β 79.8%)
Clinical β Multimodal: +38.5 points (41.3% β 79.8%) β¬οΈ MASSIVE GAIN
Why This Matters:
- Every percentage point = more lives saved
- 79.8% sensitivity means 4 out of 5 high-risk patients get timely intervention
- Only 1 out of 5 slips through (vs. 3-4 out of 5 for traditional methods)
Prerequisites:
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- Optional: NVIDIA GPU for faster training
Quick Setup:
# 1. Clone repository
git clone https://github.com/hash123shaikh/Master-Thesis-Work.git
cd Master-Thesis-Work
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Verify installation
python test_installation.pyExpected Output:
β TensorFlow 2.8.0
β scikit-learn 1.2.2
β NumPy 1.23.5
β Installation successful!
Before running, update file paths in each script to match your system:
Example for code/GaAtCNN_cln.py (line ~37):
# Change from:
dataset_clinical = numpy.loadtxt("F:/Dissertations/.../METABRIC_clinical_1980.txt", delimiter="\t")
# To your path:
dataset_clinical = numpy.loadtxt("code/Data/METABRIC/METABRIC_clinical_1980.txt", delimiter="\t")Repeat for all 4 scripts: GaAtCNN_cln.py, GaAtCNN_cnv.py, GaAtCNN_expr.py, RF.py
Complete Workflow (4 Scripts):
# Stage 1: Train individual CNNs (can run in parallel)
python code/GaAtCNN_cln.py # ~15-20 min β results/gatedAtnClnOutput.csv
python code/GaAtCNN_cnv.py # ~20-30 min β results/gatedAtnCnvOutput.csv
python code/GaAtCNN_expr.py # ~30-45 min β results/gatedAtnExpOutput.csv
# Stage 2: Train ensemble
python code/RF.py # ~5-10 min β Final predictions + metrics
# Total time: 70-105 minutes (depending on hardware)What Happens:
-
Stage 1 (Feature Extraction):
- Each CNN trains on its modality using 10-fold CV
- Extracts 50/525/200 features from penultimate layer
- Saves features to CSV for Stage 2
-
Stage 2 (Ensemble):
- Loads all extracted features (775 total)
- Trains Random Forest with 10-fold CV
- Outputs predictions, metrics, and ROC curves
Output Files:
results/
βββ gatedAtnClnOutput.csv # Clinical features (1980 Γ 50)
βββ gatedAtnCnvOutput.csv # CNV features (1980 Γ 200)
βββ gatedAtnExpOutput.csv # Expression features (1980 Γ 525)
βββ clinical_gated_attention.png # Model architecture diagram
βββ roc_curve_clinical.png # Clinical CNN ROC
βββ roc_curve_cnv.png # CNV CNN ROC
βββ roc_curve_expression.png # Expression CNN ROC
βββ roc_curve_ensemble.png # Final ensemble ROC (AUC: 0.950)
Master-Thesis-Work/
β
βββ code/
β βββ Data/METABRIC/
β β βββ METABRIC_clinical_1980.txt # Clinical features
β β βββ METABRIC_cnv_1980.txt # Copy number alterations
β β βββ METABRIC_gene_exp_1980.txt # Gene expression
β β
β βββ GaAtCNN_cln.py # βοΈ Train clinical CNN
β βββ GaAtCNN_cnv.py # βοΈ Train CNV CNN
β βββ GaAtCNN_expr.py # βοΈ Train expression CNN
β βββ RF.py # βοΈ Train ensemble
β
βββ docs/
β βββ Hasan_MTech_Dissertation_PPT.pdf # Presentation slides
β βββ Hasan_Dissertation_Report.pdf # Full thesis
β
βββ results/ # π Generated outputs (not in repo)
β βββ *.csv # Extracted features
β βββ figures/ # Plots and diagrams
β
βββ requirements.txt # Python dependencies
βββ test_installation.py # Verify setup
βββ .gitignore # Exclude generated files
βββ LICENSE # MIT License
βββ EXECUTIVE_SUMMARY.md # One-page overview
βββ INSTALLATION.md # Detailed setup guide
βββ RESULTS_SUMMARY.md # Complete results tables
βββ README.md # This file
Key Scripts:
| Script | Purpose | Input | Output | Time |
|---|---|---|---|---|
GaAtCNN_cln.py |
Extract clinical features | 25 features | 50 features | ~15-20 min |
GaAtCNN_cnv.py |
Extract CNA features | ~200 features | 200 features | ~20-30 min |
GaAtCNN_expr.py |
Extract expression features | ~400 features | 525 features | ~30-45 min |
RF.py |
Final ensemble prediction | 775 features | Predictions + metrics | ~5-10 min |
Medium-Term:
- Additional cancer types (lung, prostate, colorectal)
- Survival analysis (time-to-event modeling)
- SHAP explainability analysis
- Hyperparameter optimization (Optuna)
- External validation (TCGA, independent cohorts)
Long-Term Vision:
- Clinical deployment (REST API)
- Multi-institutional validation
- Histopathology image integration
- Transfer learning from pan-cancer models
- Federated learning for privacy-preserving collaboration
If you use this work, please cite:
@mastersthesis{shaikh2023multimodal,
title={Multimodal Data Analytics for Predicting the Survival of Cancer Patients},
author={Shaikh, Hasan},
year={2023},
school={Aligarh Muslim University},
type={Master's Thesis},
department={Computer Engineering},
supervisor={Ali, Rashid}
}Related Publications:
- Sun, D., et al. (2018). MDNNMD: Multidimensional deep neural network for survival prediction. BMC Bioinformatics, 19(1), 1-13.
- Arya, N., & Saha, S. (2022). Multi-modal classification for breast cancer prognosis. Scientific Reports, 12(1), 1-13.
- Curtis, C., et al. (2012). Genomic architecture of 2,000 breast tumours. Nature, 486(7403), 346-352.
Hasan Shaikh
M.Tech Student, Computer Engineering
Aligarh Muslim University, India
π§ Email: hasanshaikh3198@gmail.com
πΌ LinkedIn: https://linkedin.com/in/hasann-shaikh
π GitHub: @hash123shaikh
Supervisor:
Prof. Rashid Ali
Department of Computer Engineering, AMU
π§ rashidali@zhcet.ac.in
Q: "FileNotFoundError: No such file or directory"
A: Update file paths in scripts. See Usage section for details.
Q: "ValueError: could not convert string to float"
A: Check delimiter: use \t (tab) for input data, , (comma) for RF.py
Q: "ModuleNotFoundError: No module named 'tensorflow'"
A: Activate venv and install: pip install tensorflow==2.8.0
Q: "Scripts run but no CSV files created"
A: Check path variable in scriptsβmust point to writable results/ directory
Q: "Results differ from paper"
A: Normal due to GPU non-determinism. Run multiple times and average.
- Check README for solutions
- Review thesis PDF (
docs/Hasan_Dissertation_Report.pdf) - Open GitHub issue with error details
- Email author for complex problems
- METABRIC Consortium for public dataset access
- cBioPortal for data hosting infrastructure
- Department of Computer Engineering, AMU for computational resources
- Prof. Rashid Ali for supervision and guidance
- Open-source community (TensorFlow, scikit-learn) for excellent tools
MIT License - Free for academic and research use.
Attribution required - Please cite the thesis when using this code.
See LICENSE file for complete terms.
- Dataset: METABRIC on cBioPortal
- TensorFlow Docs: tensorflow.org
- scikit-learn Guide: scikit-learn.org
- Full Thesis:
docs/Hasan_Dissertation_Report.pdf






