A Research-Oriented Portfolio of Applied Predictive Modeling
Welcome to my centralized repository of machine learning classification projects. This collection highlights end-to-end workflows spanning:
- Exploratory Data Analysis (EDA)
- Feature engineering & preprocessing pipelines
- Model development & comparison
- Hyperparameter tuning
- Evaluation & interpretation
Each project folder is self-contained, reproducible, and written in a clear, research-oriented style suitable for academic, portfolio, and professional use.
Each subdirectory typically includes:
- Jupyter Notebooks – full workflow (EDA → preprocessing → modeling → evaluation)
- Saved Models – pipelines, tuned models (
.joblib,.pkl) - Dataset – or reference links for external datasets
- Documentation – project-specific
README.md
| Project Directory | Description |
|---|---|
adultCensusIncome/ |
Predicting whether an individual earns >50K using the UCI Adult dataset |
IEE-Fraud-Detection/ |
Large-scale transaction fraud detection using IEEE-CIS dataset |
breastCancerWisconsin/ |
Classifying malignant vs benign tumors |
bankDataSet/ |
Predicting subscription to a bank term deposit |
churnModel/ |
Customer churn prediction |
titanicModel/ |
Titanic survival prediction |
heartDiseaseUCIModel/ |
Heart disease classification |
wineQualityModel/ |
Red wine quality prediction |
Below are 3 of the most complete, research-oriented projects in this repository. Each showcases advanced EDA, custom preprocessing, and model experimentation.
Goal: Predict whether a person earns >50K using demographic & socioeconomic features. Dataset: UCI Adult Census Income (32k rows, mixed categorical & numeric)
- Detailed EDA (imbalance, missingness patterns, heavy skew in capital gain/loss)
- Dual preprocessing tracks for tree models vs linear models
- Outlier handling using IQR & Z-score rules
- Trained 10+ models (Logistic Regression, SVC, LightGBM, CatBoost, RF, KNN, etc.)
- Saved modular pipelines for each model
- F1: 0.6253
- Accuracy: 0.8397
- Excellent performance on imbalanced data
📂 Directory: Classification/adultCensusIncome/
Goal: Detect fraudulent transactions using the massive IEEE-CIS dataset (~1M rows). Challenge: Heavy imbalance, high dimensionality, complex categoricals.
-
Robust EDA with Spearman correlation (sampled due to dataset size)
-
Feature reduction using correlation threshold (|ρ| > 0.9)
-
Skew-based feature grouping → custom preprocessing pipelines:
- Standard scaling
- Robust scaling
- Power transform (Yeo-Johnson)
-
Advanced categorical encoders (Ordinal vs Target Encoding based on model family)
-
Trained multiple models (RF, XGBoost, AdaBoost, GBoost, NB, KNN, etc.)
- Accuracy: 0.9817
- F1 Score: 0.6642
- ROC-AUC: 0.9468
Includes learning curve & feature importance analysis.
📂 Directory: Classification/IEE-Fraud-Detection/
Goal: Classify tumors as benign or malignant. A well-structured project with clean pipeline and strong evaluation.
- Preprocessing pipeline with StandardScaler + imputation
- Multiple models trained & compared
- Balanced dataset → focus on precision/recall tradeoffs
- Visualizations: correlation matrix, distributions, pairplots, classifier curves
- Accuracy: 0.9649
- Precision: 0.9857
- Recall: 0.9583
- F1 Score: 0.9718
📂 Directory: Classification/breastCancerWisconsin/
These projects are smaller or exploratory, included for completeness:
Predict term deposit subscription. Includes SMOTE-based balancing, model comparison, and SHAP interpretation.
Classic ML task with feature engineering (family size, title extraction).
Baseline pipeline complete → tuning pending.
EDA done → modeling next.
Combined pipeline for regression → classification setup optional.
Languages:
- Python
Core Libraries:
pandas,numpy,matplotlib,seabornscikit-learn,xgboost,lightgbm,catboostimblearn(SMOTE)scipy,statsmodelsjoblibfor model persistence
Notebooks: Jupyter / Kaggle / Colab
Each project contains:
A research-style README Notebooks documenting full ML workflow Saved models Visualizations & metrics
Start by exploring any project directory listed above.