Skip to content

Binary classification pipeline for the Spaceship Titanic Kaggle challenge using XGBoost, AutoML, and feature engineering for high-accuracy predictions.

Notifications You must be signed in to change notification settings

razzf/survival-prediction-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spaceship Titanic Competition: Predictive ML Modeling of a binary classification problem

Image

Project Overview

This project focuses on predicting the fate of passengers aboard the Spaceship Titanic, a fictional interstellar vessel that encountered a spacetime anomaly. The objective is to determine whether each passenger was transported to an alternate dimension (Transported: True) or remained in their original state (Transported: False). The project involves data preprocessing, feature engineering, exploratory data analysis (EDA), and the development of machine learning models to perform binary classification. The ultimate goal is to produce accurate predictions that can assist rescue crews in identifying and retrieving transported passengers. During the project automatized hyperparameter tuning and AutoML were practiced.

Objectives

The main objective is to build an accurate machine learning model that predicts whether a passenger was transported during the Spaceship Titanic incident. The evaluation metric is accuracy, and models must achieve a minimum accuracy score of 0.79 to be considered successful. Additional goals include:

  • Exploring and visualizing key features such as age, cabin, passenger groupings, spending behavior, and cryo-sleep status to uncover meaningful patterns.

  • Engineering new features and handling missing data effectively to improve model performance.

  • Comparing and evaluating various classification algorithms to identify the most effective approach.

  • Generating a submission file with the predictions.

Key Insights

  • CryoSleep is the strongest predictor of being transported. Passengers in CryoSleep are far more likely to be transported than those awake.

  • HomePlanet, especially Europa, shows important interactions with CryoSleep, increasing predictive power. Other influential features include spending on Spa, RoomService, and VRDeck.

  • High onboard spending on Spa, VRDeck, and RoomService is linked to a lower chance of transport, while spending in FoodCourt and ShoppingMall correlates with higher transport rates.

  • The best result (accuracy: 0.80640) was achieved using a non-tuned XGBoost model with transformed features. More complex models (tuned, ensembles, AutoML) did not improve performance.

Model Evaluation

The model results are evaluated on the submitted test prediction while participating in the Kaggle competition.

Table of Contents

Installation

To set up this project locally:

  1. Clone the repository:
    git clone https://github.com/razzf/survival-prediction-machine-learning.git
  2. Navigate to the project directory:
    cd survival-prediction-machine-learning
  3. Install required packages: Ensure Python is installed and use the following command:
    pip install -r requirements.txt

Usage

Open the notebook in Jupyter or JupyterLab to explore the analysis. Execute the cells sequentially to understand the workflow, from data exploration to model building and evaluation. For an in-depth exploration, refer to the notebook overview below.

Data

The dataset is located in the /data directory. It is originally derived from Kaggle. The data set reflects the passenger list of a fictional Spaceship Titanic during an incident. It contains data of about 13.000 passengers for 12 features (e.g. age, name, HomePlanet, Destination, expenditures, etc.) and one variable containing information if the person was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.

Directory Structure

project-root/
├── custom_modules/
│   ├── custom_transformers.py         # Module for custom pipeline transformers
│   ├── plotting.py                    # Module for plotting visualizations
│   └── stat_calculations.py           # Module for statistical calculations
├── data/
│   ├── test.csv                       # training dataset inkluding target
│   └── train.csv                      # test dataset
├── notebooks/
│   ├── AutoML_1/                      # Results from the AutoML process 1 
│   ├── AutoML_2/                      # Results from the AutoML process 2  
│   ├── data preparation, EDA, statistical inference.ipynb   # Jupyter notebook_1 for data prep, EDA, and statistical inference
│   ├── machine learning modeling.ipynb                      # Jupyter notebook_2 for machine learning modeling and evaluation
│   └── submission.csv                 # Latest submitted test prediction 
├── requirements.txt                   # Python dependencies
└── README.md                          # Project documentation

Requirements

The requirements.txt file lists all Python dependencies. Install them using the command provided above.

Notebook Overview

The notebooks include the following sections:

Notebook 1: Data Preparation, EDA, and Statistical Inference

  1. Introduction
  2. Problem Discovery
  3. Data Acquisition
  4. Exploratory Data Analysis
  5. Statistical Inference and Evaluation

Notebook 2: Machine Learning Modeling

  1. Introduction
  2. Load data
  3. Split train data
  4. Feature Engineering
  5. Model Training, Evaluation, and Tuning
  6. AutoML
  7. Submission
  8. Suggestions for Improvement

About

Binary classification pipeline for the Spaceship Titanic Kaggle challenge using XGBoost, AutoML, and feature engineering for high-accuracy predictions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published