Skip to content

jasonjiang8866/tabularML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Generic Machine Learning Pipeline

This repository contains a versatile machine learning pipeline, exemplified with a housing price prediction task. While the current implementation is tailored to housing price prediction, the structure is designed to be adaptable for various other prediction tasks with minor modifications.

Overview

The pipeline follows these main steps:

  1. Data Loading: Loads the dataset. In the example, housing datasets such as California and Ames are used.
  2. Data Exploration: Explores the dataset to understand its characteristics.
  3. Data Preprocessing: Processes the data to ensure it is suitable for modeling.
  4. Train-Test Split: Divides the dataset into training and testing subsets.
  5. Feature Selection: Uses RandomForestRegressor to identify significant features. This step can be adapted for other feature selection methods.
  6. Model Building: Constructs a predictive model. The example uses the LightGBM algorithm, but other algorithms can be substituted.
  7. Hyperparameter Tuning: Optimizes model parameters. GridSearchCV is employed in the example.
  8. Model Evaluation: Assesses the model's performance using various metrics.
  9. Model Saving: Serializes the trained model for deployment or future use.

Libraries Utilized

  • pandas
  • scikit-learn
  • LightGBM
  • joblib

Usage

To execute this pipeline, run:

python pipeline.py

For adapting this pipeline to other tasks, users may need to adjust data loading, preprocessing, and the choice of machine learning algorithm as per the specific requirements.

About

a simple pipeline for tabular data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages