This is an end-to-end ML project for tabular data that incorporates software engineering principles in machine learning. It spans the whole lifecycle of a ML model, from data exploration, preprocessing, feature engineering, model selection, training, evaluation, to deployment.
The project leverages the Diabetes Health Indicators public dataset from UCI, which was originally sourced from CDC. The dataset comprises various information about patients, such as demographics, lab results, and self-reported health history. The goal is to develop a classifier that can discern whether a patient has diabetes, is pre-diabetic, or healthy.
The project adheres to best practices of machine learning engineering, such as modular code, documentation, testing, logging, configuration, and version control. The project also demonstrates how to utilize various tools and frameworks, such as pandas, scikit-learn, feast, optuna, experiment tracking using Comet, and Docker, to facilitate the ML workflow and enhance the model performance.
Some of the notable features of the project are:
-
The repo is configurable using config files, which allow the user to easily change the dataset, hyperparameters, and other settings without modifying the code.
-
The project uses hyperparameters optimization using optuna, which is a hyperparameters optimization framework that offers several advantages, such as efficient search algorithms, parallel and distributed optimization.
-
The project uses f_beta score as the optimization metric, which is a generalization of f1 score (i.e., beta = 1) that can be adjusted to give more weights to precision or recall. The use of f_beta score is appropriate in many practical use cases, as in reality precision and recall are rarely equally important.
-
The project ensures reproducibility using Docker, which is a tool that creates isolated environments for running applications. The project containerizes the champion model with its dependencies, and provides a devcontainer configuration that allows the user to recreate the dev environment in VS code.
-
The project uses experiment tracking and logging using Comet, which is a platform for managing and comparing ML experiments. The project logs model performance, hyperparameters, and artifacts to Comet, which can be accessed through a web dashboard. The user can also visualize and compare different experiments using Comet.
-
This project uses a makefile to provide convenient CLI commands to improve the efficiency, reliability, and quality of development, testing, and deployment. For instance, running the command
make prep_datawill transform the raw data into features and stores them in local file to be ingested by feature store, whereas runningmake setup_feastwill setup feature store and apply its feature definitions.
The project consists of the following folders and files:
-
notebooks: contains a notebook (eda.ipynb) that conducts exploratory data analysis (EDA) on the dataset, such as descriptive statistics, data visualization, and correlation analysis. It also establishes a baseline model (logistic regression) using scikit-learn, which achieves high precision and recall scores (above 0.80) on the test set. This project, however, does not focus on achieving higher accuracy, as that is beyond the scope of this project. Other notebooks can be added to this folder if needed. -
src/config: contains configuration files that includes parameters for data preprocessing and transformation, and model training. -
src/feature_store: contains the scripts for data ingestion and transformation. The script (generate_initial_data.py) imports the original dataset from the source UCI, and creates inference set (5% of the original dataset). The inference set is used to simulate production data, which is scored using the deployed model via a REST API call. The script (prep_data.py) preprocesses and transforms the raw dataset before ingesting it by feature store. For more details about feature store setup, see README.md in the feature_store folder. -
src/training: contains the scripts for data splitting, model training, evaluation, and selection.Training Flow:
- split_data.py: Imports preprocessed features from the feature store using Feast's
get_historical_features(), then splits into train/validation/test sets and saves as parquet files insrc/feature_store/feature_repo/data/. The class labels at this stage are raw values stored from the feature store. - train.py: Loads the parquet files created by split_data.py, then calls
prepare_data()which:- Encodes class labels using
LabelEncoder(handles both string and integer class labels in config) - Creates data transformation pipeline (imputation, scaling, encoding, feature selection)
- Returns preprocessed features and encoded class labels for model training
- Encodes class labels using
- Hyperparameter optimization: Uses Optuna to optimize models (Logistic Regression, Random Forest, LightGBM, XGBoost)
- Model tracking: All experiments tracked using Comet.ml for metrics, parameters, and artifacts
- evaluate.py: Selects best model, calibrates it, and registers as champion model in Comet workspace if test score meets threshold
- split_data.py: Imports preprocessed features from the feature store using Feast's
-
src/inference: contains the script for scoring new data via REST API using containerized model, which is deployed using GitHub Actions CI/CD pipeline. -
src/utils: contains utility modules used throughout the project, including logger class that redirects printed messages in addition to some select events that need to be logged to logger objects and constans that defines paths to to data and training artifacts directories. -
tests: contains test scripts to validate the functionality of the ML components in the project. By having a dedicated tests folder, it promotes code quality, reliability, and helps in identifying and fixing issues early in the development process.
Below is the repo structure.
end-to-end-ml
├── LICENSE
├── Makefile
├── README.md
├── __init__.py
├── dist
│ ├── end_to_end_ml-0.1.0-py3-none-any.whl
│ └── end_to_end_ml-0.1.0.tar.gz
├── img
│ └── feast_workflow.png
├── notebooks
│ ├── eda.ipynb
│ └── utils.py
├── pyproject.toml
├── pytest.ini
├── src
│ ├── __init__.py
│ ├── config
│ │ ├── __init__.py
│ │ ├── feature-store-config.yml
│ │ ├── logging.conf
│ │ └── training-config.yml
│ ├── end_to_end_ml.egg-info
│ │ ├── PKG-INFO
│ │ ├── SOURCES.txt
│ │ ├── dependency_links.txt
│ │ ├── requires.txt
│ │ └── top_level.txt
│ ├── feature_store
│ │ ├── README.md
│ │ ├── feature_repo
│ │ │ ├── data
│ │ │ │ ├── historical_data.parquet
│ │ │ │ ├── inference.parquet
│ │ │ │ ├── online_store.db
│ │ │ │ ├── preprocessed_dataset_features.parquet
│ │ │ │ ├── preprocessed_dataset_target.parquet
│ │ │ │ ├── raw_dataset.parquet
│ │ │ │ ├── registry.db
│ │ │ │ ├── test.parquet
│ │ │ │ ├── train.parquet
│ │ │ │ └── validation.parquet
│ │ │ ├── define_feature.py
│ │ │ └── feature_store.yaml
│ │ ├── generate_initial_data.py
│ │ ├── prep_data.py
│ │ └── utils
│ │ ├── __init__.py
│ │ ├── config.py
│ │ └── prep.py
│ ├── inference
│ │ ├── Dockerfile
│ │ ├── main.py
│ │ ├── predict.py
│ │ └── utils
│ │ ├── __init__.py
│ │ └── model.py
│ ├── training
│ │ ├── artifacts
│ │ │ ├── champion_model.pkl
│ │ │ ├── experiment_keys.csv
│ │ │ ├── lightgbm.pkl
│ │ │ ├── logistic-regression.pkl
│ │ │ ├── random-forest.pkl
│ │ │ ├── study_LGBMClassifier.csv
│ │ │ ├── study_LogisticRegression.csv
│ │ │ ├── study_RandomForestClassifier.csv
│ │ │ ├── study_XGBClassifier.csv
│ │ │ ├── voting-ensemble.pkl
│ │ │ └── xgboost.pkl
│ │ ├── evaluate.py
│ │ ├── split_data.py
│ │ ├── train.py
│ │ └── utils
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── data.py
│ │ ├── job.py
│ │ └── model.py
│ └── utils
│ ├── __init__.py
│ ├── config_loader.py
│ ├── logger.py
│ └── path.py
├── tests
│ ├── __init__.py
│ ├── test_feature_store
│ │ ├── test_data_preprocessor.py
│ │ ├── test_data_splitter.py
│ │ ├── test_data_transformer.py
│ │ └── test_feature_store_config.py
│ ├── test_inference
│ │ └── test_inference_model.py
│ ├── test_training
│ │ ├── test_data_utils.py
│ │ ├── test_job.py
│ │ ├── test_training_config.py
│ │ └── test_training_model.py
│ └── test_utils.py
└── uv.lock
This project uses uv to manage dependencies, which are defined in the pyproject.toml file. A devcontainer is configured to set up a full-featured development environment and install required dependencies in addition to some useful VS Code extensions. It allows isolating the tools, libraries, and runtimes needed for working with this project codebase, and to use VS Code’s full feature set inside the container. A devcontainer requires Docker to be up and running. It's recommended to use the devcontainer for this project to ensure consistency and reproducibility across different machines and platforms, but if not desired, for whatever reason, you can create a virtual environment and install the python dependencies by running the following commands from the project root:
python3.10 -m venv .venv
source .venv/bin/activate
make install
The training and deployment pipelines can be run in GitHub Actions. You can also run the following commands in CLI to implement all steps from generating raw dataset to pulling packaged model:
-
Import raw dataset from UCI
make get_init_data -
Preprocess data before ingesting it by feature store
make prep_data -
Setup feature store
make setup_feast -
Split dataset extracted from feature store
make split_data -
Submit training job
make train -
Evaluate models
make evaluate -
Test champion model
make test_model -
Pull containerized model
docker pull ghcr.io/adeemy/end-to-end-ml:c35fb9610651e155d7a3799644e6ff64c1a5a2db