This project showcases comprehensive data and modeling pipelines, incorporating key MLOps principles. These include data collection, modeling experimentation and tracking, model registry, workflow orchestration, model deployment, and monitoring.
Many spam emails still managed to get into inboxes daily. The chore of having to manually report each one of them as spam is repetitive and time wasting. The goal is to create a spam detection system to automate this process by treating the problem as a classification.
The order of the pipeline is as follows:
- Data Collection
- Model Experimentation and Tracking, and Orchestration
- Model Deployment
- Monitoring and Orchestration
Instead of building the API to fetch emails from my email client, I decided to simulate it using Deysi/spam-detection-dataset. The dataset consists of 8,180 train samples and 2,730 test samples. A small subset from both training and test datasets are used to train and test the model respectively. The test dataset subset also act as the reference data for data drift monitoring. Whenever unseen samples are needed, data is randomly fetched from the training dataset.
I first started experimenting the solution on a Jupyter Notebook, starter.ipynb. Then refactored the notebook codes into training/training.py. Amazon EC2, Amazon RDS and Amazon S3 are setup to host the MLFlow tracking server, to store MLFlow metadata and artifacts respectively.
Once an Amazon EC2 instance is running with MLFlow installations done, run this command with all the variables substituted:
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://postgres:<password>@<aws_rds_hostname>:5432/mlflow_db --default-artifact-root s3://<s3_bucket_name>
The MLFlow UI will be available at http://<ec2_public_address>:5000
Once the MLFlow tracking server is ready, the training code can be ran. The overview of training/training.py is as such:
- Initialize MLFlow tracking URI and experiment name
- Get the training and test datasets
- Data preprocessing
- Model training and hyperparameter runing
- Model registry staging
To ease the modelling process, it is deployed to Prefect Cloud. For Prefect Version 2.19.9, these are the steps to create a deployment:
- Make sure terminal directory is at root.
- Run prefect init and follow the UI instructions, I chose local
- Run prefect deploy and follow the UI instructions
- Run prefect worker start --pool 'mlops-capstone' to start worker
- Make sure terminal directory is at root.
- Run prefect server start and prefect server will start running at port:4200
- Run python training/training.py to see the workflow orchestration
At the end of the run, a model will be staged in MLFlow Model Registry.
In this section, I created an prediction script, predict.py with the following steps:
- Get unseen data, simulated by getting randomly from the training dataset
- Preprocess unseen data
- Initialize MLFlow tracking URI
- Load production model from MLFlow Model Registry
- Predict on the unseen data
- Write results to parquet, to simulate passing the results to email client API
The prediction script is containerized using a DockerFile with can be hosted on Amazon ECS.
Steps to build and run the container:
- Make sure terminal directory is at root.
- Run docker build -t spam-detection-predict:v1 -f .\deployment\Dockerfile .
- Run docker run -it -e AWS_ACCESS_KEY_ID= <XXX> -e AWS_SECRET_ACCESS_KEY= <XXX> spam-detection-predict:v1 <MLFLOW_TRACKING_URL> with all the variables substituted.
Once the run the successful, you should see a spam_detection.parquet file in the S3 bucket with the current year and date as part of the prefix key.
The monitoring section leverages the Evidently AI library. I have mainly modified evidently_metrics_calculation.py from the course to fit my use case. The steps of the script are:
- Prepare PostgreSQL database and table
- Initialize MLFlow tracking URI
- Load production model from MLFlow Model Registry
- Get unseen data, simulated by getting randomly from the training dataset
- Preprocess unseen data
- Predict on the unseen data
- Get reference data from MLFlow production model run
- Calculate drift metrics at 5 random time intervals
From the root directory, run
docker compose -f .\monitoring\docker-compose.yml up --build
to prepare PostgreSQL, Adminer and Grafana. This can be deployed to Amazon ECS.
Since the text embeddings play a large role in the model's performance, three embedding drift metrics are used, namely with the methods: classifier model, maximum mean discrepancy and cosine distance. This is based on a good blog writeup by Evidently AI. A drift is considered to be occured when at least two drifts from the mentioned three methods are detected. When a drift is detected, a flow from the model training deployment Spam Detector Capstone/mlops-capstone-spam-detector
is automatically ran to retrain the model.
The monitoring script has also been deployed to Prefect Cloud with the same steps as described earlier. A flow run can be executed by running python monitoring/evidently_metrics_calculation.py.
Once everything has run successfully, Adminer can be logged in from http://localhost:8081/
and you should see some data in the embedding_drift_metrics
table.
The Grafana dashboard can be accessed from http://localhost:3001/
.
The following have been developed:
- Unit test
- Integration test
- auto code formatter using Black
- Makefile