This repository contains the solutions for the Big Data Analytics course (Summer Semester 2025, University of Luxembourg), focusing on applying Apache Spark for various machine learning tasks.
The project tackles three distinct problems using Apache Spark's MLlib library:
-
Problem 1: Predicting Heart Diseases
- Task: Binary classification to predict the presence of heart disease based on health indicators.
- Dataset: "Key Indicators of Heart Disease" from Kaggle.
- Models: Decision Trees and Random Forests.
- Techniques: Data loading and schema definition, ML Pipelines, hyperparameter tuning using
CrossValidator
andTrainValidationSplit
, model evaluation usingBinaryClassificationMetrics
. Also explores predicting a multi-class label (AgeCategory
) with Random Forests andMulticlassMetrics
.
-
Problem 2: Weather Prediction
- Task: Regression to predict air temperature in Luxembourg based on historical weather data.
- Dataset: NOAA Integrated Surface Data (ISD) for Luxembourg (station 065900).
- Models: Linear Regression and Random Forest Regressor.
- Techniques: Custom data parsing (NOAA format), time-based train/validation/test splitting, model training, and evaluation using Mean Squared Error (MSE).
-
Problem 3: Recommender Systems
- Task: Building a music artist recommender system.
- Dataset: AudioScrobbler dataset.
- Model: Alternating Least Squares (ALS) Matrix Factorization.
- Techniques: Data filtering, custom user-centric train/test splitting, hyperparameter tuning (manual loops and cross-validation), evaluation using Area Under ROC Curve (AUC) via
BinaryClassificationMetrics
, comparing against a baseline, and testing with a new user profile.
- Apache Spark: Core, SQL, and MLlib modules.
- Scala / Python: (mostly scala)
- Heart Disease: Kaggle Dataset Link
- Weather Data: NOAA ISD (http://www1.ncdc.noaa.gov/pub/data/noaa), specifically files for USAF identifier 065900.
- Music Recommendations: AudioScrobbler dataset
Note: Datasets might need to be downloaded separately and placed in appropriate directories or have their paths updated within the code.