Skip to content

PaulAroo/spark-ml-practice-project

Repository files navigation

This repository contains the solutions for the Big Data Analytics course (Summer Semester 2025, University of Luxembourg), focusing on applying Apache Spark for various machine learning tasks.

Overview

The project tackles three distinct problems using Apache Spark's MLlib library:

  1. Problem 1: Predicting Heart Diseases

    • Task: Binary classification to predict the presence of heart disease based on health indicators.
    • Dataset: "Key Indicators of Heart Disease" from Kaggle.
    • Models: Decision Trees and Random Forests.
    • Techniques: Data loading and schema definition, ML Pipelines, hyperparameter tuning using CrossValidator and TrainValidationSplit, model evaluation using BinaryClassificationMetrics. Also explores predicting a multi-class label (AgeCategory) with Random Forests and MulticlassMetrics.
  2. Problem 2: Weather Prediction

    • Task: Regression to predict air temperature in Luxembourg based on historical weather data.
    • Dataset: NOAA Integrated Surface Data (ISD) for Luxembourg (station 065900).
    • Models: Linear Regression and Random Forest Regressor.
    • Techniques: Custom data parsing (NOAA format), time-based train/validation/test splitting, model training, and evaluation using Mean Squared Error (MSE).
  3. Problem 3: Recommender Systems

    • Task: Building a music artist recommender system.
    • Dataset: AudioScrobbler dataset.
    • Model: Alternating Least Squares (ALS) Matrix Factorization.
    • Techniques: Data filtering, custom user-centric train/test splitting, hyperparameter tuning (manual loops and cross-validation), evaluation using Area Under ROC Curve (AUC) via BinaryClassificationMetrics, comparing against a baseline, and testing with a new user profile.

Technologies Used

  • Apache Spark: Core, SQL, and MLlib modules.
  • Scala / Python: (mostly scala)

Datasets

Note: Datasets might need to be downloaded separately and placed in appropriate directories or have their paths updated within the code.

About

applying Apache Spark for various machine learning tasks.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •