🚀 Santander Customer Value Prediction

A machine learning solution for the Kaggle Santander Customer Value Prediction competition, achieving high accuracy through model stacking and ensemble techniques.

📚 Description

This repository contains a solution for the Santander Customer Value Prediction Kaggle competition. The challenge involves predicting the value of transactions for potential customers, helping Santander provide a more personalized customer experience.

The solution implements a stacked ensemble approach combining:

Gradient Boosting Regressor
LightGBM Regressor
Random Forest Regressor

These models are combined using a Lasso regression stacking technique to produce the final predictions, with k-fold cross-validation to ensure robustness.

🔧 Prerequisites

Python 3.x
Required libraries:
- pandas
- numpy
- scikit-learn
- lightgbm

📊 Features

Feature Selection: Implements a curated list of the most predictive features
Data Preprocessing: Handles missing values and adds statistical features
Model Ensemble: Combines multiple regression models for improved accuracy
K-Fold Validation: Uses 20-fold cross-validation for robust performance evaluation
Stacking Technique: Employs model stacking with Lasso regression for final predictions

🛠️ Setup Guide

Clone this repository:

git clone https://github.com/yourusername/KaggleSantanderValuePrediction.git
cd KaggleSantanderValuePrediction

Install required dependencies:

pip install pandas numpy scikit-learn lightgbm

Download the competition data from Kaggle and place the train.csv and test.csv files in the repository root directory.

🔬 Usage

Run the main script to train the models and generate predictions:

python valuePrediction.py

This will:

Load and preprocess the training and test data
Train multiple regression models using k-fold cross-validation
Create an ensemble prediction using model stacking
Generate a test_set_prediction.csv file with predictions in the format required for Kaggle submission

🧠 Approach

The solution follows these key steps:

Data Preprocessing:
- Removing columns with only zeros
- Imputing missing values
- Adding statistical features (mean, median, sum, standard deviation, kurtosis)
Feature Selection:
- Using a predefined list of the most predictive features
Model Training:
- Training multiple regression models with optimized hyperparameters
- Using 20-fold cross-validation to ensure robustness
Ensemble Creation:
- Combining model predictions using stacking
- Using Lasso regression as a meta-learner
- Applying geometric mean for final prediction aggregation

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
sample_submission.csv		sample_submission.csv
test_set_prediction.csv		test_set_prediction.csv
valuePrediction.py		valuePrediction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Santander Customer Value Prediction

📚 Description

🔧 Prerequisites

📊 Features

🛠️ Setup Guide

🔬 Usage

🧠 Approach

📜 License

About

Uh oh!

Releases

Packages

Languages

License

corticalstack/KaggleSantanderValuePrediction

Folders and files

Latest commit

History

Repository files navigation

🚀 Santander Customer Value Prediction

📚 Description

🔧 Prerequisites

📊 Features

🛠️ Setup Guide

🔬 Usage

🧠 Approach

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages