A machine learning solution for the Kaggle Santander Customer Value Prediction competition, achieving high accuracy through model stacking and ensemble techniques.
This repository contains a solution for the Santander Customer Value Prediction Kaggle competition. The challenge involves predicting the value of transactions for potential customers, helping Santander provide a more personalized customer experience.
The solution implements a stacked ensemble approach combining:
- Gradient Boosting Regressor
- LightGBM Regressor
- Random Forest Regressor
These models are combined using a Lasso regression stacking technique to produce the final predictions, with k-fold cross-validation to ensure robustness.
- Python 3.x
- Required libraries:
- pandas
- numpy
- scikit-learn
- lightgbm
- Feature Selection: Implements a curated list of the most predictive features
- Data Preprocessing: Handles missing values and adds statistical features
- Model Ensemble: Combines multiple regression models for improved accuracy
- K-Fold Validation: Uses 20-fold cross-validation for robust performance evaluation
- Stacking Technique: Employs model stacking with Lasso regression for final predictions
-
Clone this repository:
git clone https://github.com/yourusername/KaggleSantanderValuePrediction.git cd KaggleSantanderValuePrediction -
Install required dependencies:
pip install pandas numpy scikit-learn lightgbm
-
Download the competition data from Kaggle and place the
train.csvandtest.csvfiles in the repository root directory.
Run the main script to train the models and generate predictions:
python valuePrediction.pyThis will:
- Load and preprocess the training and test data
- Train multiple regression models using k-fold cross-validation
- Create an ensemble prediction using model stacking
- Generate a
test_set_prediction.csvfile with predictions in the format required for Kaggle submission
The solution follows these key steps:
-
Data Preprocessing:
- Removing columns with only zeros
- Imputing missing values
- Adding statistical features (mean, median, sum, standard deviation, kurtosis)
-
Feature Selection:
- Using a predefined list of the most predictive features
-
Model Training:
- Training multiple regression models with optimized hyperparameters
- Using 20-fold cross-validation to ensure robustness
-
Ensemble Creation:
- Combining model predictions using stacking
- Using Lasso regression as a meta-learner
- Applying geometric mean for final prediction aggregation
This project is licensed under the MIT License - see the LICENSE file for details.