The purpose of the project is to create a bikeshare rental prediction engine, to help Washington's government understand bikeshare rental patterns and which regions to focus on with spending to improve bikeshare rental infrastructure
The analytical team wants to assist the government by building a rental prediction engine (a batch service application), which will help to score past and future data of bike rentals in Washington. The engine will help the government determine past and new trends, as the ways in which people use the service evolves overtime. The insights will enable the government to provide better infrastrure for the bikepaths in Washington DC. The engine delivers on the promise by learning from rides that happended in the near past, and then make predictions on estimated time for each new ride. The predicted outcome is provided in minutes for the ride. Highly disparate predictions can be used to understand infrastructiure problems in the state and help hone on specific regions requiring attention.
Data Company - The data comes from capital bikeshare company Datasets - The data is collected and made available for consumption on a monthly basis
The section covers tools used to run the project
- Mlflow for experiment tracking, versioning the model
- Mfllow for model registry and management
- Prefect for workflow orchistration of model training pipeline, monitoring pipeline and ride scoring pipeline
- Prefect for scheduling pipelines
- Evideltly for monitoring of model, data and feature drifts
- Terraform for provisioning infrastructure for storing mlflow artifacts, prefect block storage and ec2 infrastructure for model training, monitoring and scoring bike rides
- Github actions for continuous integration and deployment workflows
- AWS web services as a cloud provider
- .github - Contains logic and files for github actions
- deploy - Contains logic and files for infrastructure as code
- model-deployment - Contains logic and files for scoring bike share rental in a batch mode, and unit tests and integration tests associated with scoring logic
- training-and-monitoring - Contains logic for model training and retraining on a schedule. Also contains logic for monitoring model performance and drifts on a schedule, with added logic to conditionally retrain the model when drift is detected
- Prerequisites
- Create an AWS IAM role with permissions to create buckets, ec2
- Add programmatic access to the role and store access key and secret access key securely
- Initialise the terraform with the IAM role
- Have anaconda and python 3.9+
- Have git for cloning the repo
For provisioning Terraform - please check the link below Terraform provisioning
To run it locally or in EC2 instance steps are the same - But infrastructure must be provisioned
-
Run
make apply_stage_local
from the root directory to provision buckets, IAM and EC2 -
Run
make create_monitoring_stage
from the root directory to create prefect flows, deployment and block storage for training and monitoring -
(Optional) if want to run things on EC2 instance - Please follow this link Environment setup and ssh
-
Run
make setup
for the preliminary setup for local or ec2 environment -
Run
make start_mlflow_stage
in the root directory to start mlflow server -
In the separate terminal, ensure your aws profile is activated, but not required if you are on ec2
-
Run
cd training-and-monitoring && pipenv install --dev
and then runpipenv shell
-
Download the data and run
python get_data.py
-
Run
cd prefect_training_monitoring
and then runpython model_training.py
-
To deploy training and run on a schedule run
bash run_training.sh
- In the root directory run
make create_scoring_stage
- In the root directory start mlflow by running
make start_mlflow_stage
- Run
cd model-deployment && pipenv install --dev
and then runpipenv shell
- Run
cd prefect_deployment
and then runpython score.py 2022 05
- To deploy and keep running on a schedule on EC2 or locally run
bash run.sh
- In the root directory start mlflow by running
make start_mlflow_stage
- Run
cd training-and-monitoring && pipenv install --dev
and then runpipenv shell
- Run
cd prefect_training_monitoring
and then runpython monitoring.py
- To deploy and run on a schedule run
bash run_monitoring.sh
- Run
cd model-deployment
- To run unit tests run
make quality_checks
- To run integration tests run
make integration_tests
- For stage: In the root directory of the repo run
make plan_stage
- For prod: In the root directory of the repo run
make plan_prod
- For stage: In the root directory of the repo run
make apply_stage_local
- For prod: In the root directory of the repo run
make apply_prod_local
- For stage: In the root directory of the repo run
make destroy_stage_local
- For prod: In the root directory of the repo run
make destroy_prod_local
Nakul Bajaj @Nakulbajaj101