This GitHub repository contains the final project submission for the Data-driven Computing Architectures course (2025). The project implements a data pipeline to ingest, process, and visualize a student depression dataset using Snowflake, followed by training a machine learning model to predict depression. The pipeline adheres to the Medallion Architecture (Bronze, Silver, Gold layers) and provides actionable insights into student mental health.
- Name: Md Aslam Hossain
- Contribution: Sole contributor, responsible for designing, implementing, and documenting the pipeline and ML model. All work is tracked via a clear history of commits in this repository.
This project focuses on building a data pipeline to analyze student mental health data (student_depression_dataset.csv
) through four stages:
- Ingestion: Loads raw CSV data into Snowflake’s bronze layer (
BRONZE_STUDENT_DATA
) and tracks lineage inDATA_LINEAGE
usingingest.py
. - Processing: Cleans and aggregates data into silver (
SILVER_STUDENT_DATA
) and gold (GOLD_STUDENT_INSIGHTS
) layers withprocess.py
. - Visualization: Generates visual insights (e.g., depression rates by gender, CGPA vs. pressure) saved in
example/
usingvisualize.py
. - Modeling: Trains a Random Forest Classifier to predict depression, saved as
model/depression_model.joblib
withmodel.py
.
The pipeline leverages Snowflake for scalable data storage and Python for processing and analysis, culminating in both visual outputs and a predictive model.
code/
: Core pipeline scripts and ML model training. Seecode/README.md
for details.data/
: Sample input data (student_depression_dataset.csv
). Seedata/README.md
.docs/
: Additional scripts or notebooks (placeholder). Seedocs/README.md
.example/
: Output visualizations and pipeline run examples. Seeexample/README.md
.model/
: Trained ML model file (depression_model.joblib
) generated bymodel.py
.
-
Clone the Repository:
git clone https://github.com/aa-it-vasa/ddca2025-project-group-24.git cd ddca2025-project-group-24
-
Install all the required package:
pip install -r requirements.txt
-
Command for ETL:
# 1. Ingest raw data python code/ingest.py # 2. Process to Silver/Gold python code/process.py # 3. Generate visuals python code/visualize.py # 3. Train the Prediction Model python code/model.py