This repository contains a versatile machine learning pipeline, exemplified with a housing price prediction task. While the current implementation is tailored to housing price prediction, the structure is designed to be adaptable for various other prediction tasks with minor modifications.
The pipeline follows these main steps:
- Data Loading: Loads the dataset. In the example, housing datasets such as California and Ames are used.
- Data Exploration: Explores the dataset to understand its characteristics.
- Data Preprocessing: Processes the data to ensure it is suitable for modeling.
- Train-Test Split: Divides the dataset into training and testing subsets.
- Feature Selection: Uses RandomForestRegressor to identify significant features. This step can be adapted for other feature selection methods.
- Model Building: Constructs a predictive model. The example uses the LightGBM algorithm, but other algorithms can be substituted.
- Hyperparameter Tuning: Optimizes model parameters. GridSearchCV is employed in the example.
- Model Evaluation: Assesses the model's performance using various metrics.
- Model Saving: Serializes the trained model for deployment or future use.
- pandas
- scikit-learn
- LightGBM
- joblib
To execute this pipeline, run:
python pipeline.py
For adapting this pipeline to other tasks, users may need to adjust data loading, preprocessing, and the choice of machine learning algorithm as per the specific requirements.