Skip to content

amorsi1/Embedded-spam-MLOps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam checking using vector embeddings

Filtering spam mail is an arms race between hackers and security teams, but most spam emails in my inbox look really obviously to spot. Using vectors to represent the text content of the emails, I found that even a really simple classifier could get an F1 score of 0.97 on a clean dataset.

As part of the MLOps-zoomcamp course, I've developed this into an end-to-end cloud pipeline that lets you simply type in some text and classify whether or not it's spam.

Try it out!

Link (server may take some time to start up)

MLOPs-spam-clasifier-demo

Architecture

MLOps spam_ham architecture

Installation instructions

Clone the repo and install python packages

git clone https://github.com/amorsi1/MLOps_spam_classifier
pip install pipenv
cd Embedded-spam-MLOps
pipenv install --dev

a .env file is used to centralize environmental variables, before running any code locally make sure to create this file and populate it with the following variables:

MLFLOW_TRACKING_URI=http://mlflow-server:8080 
MLFLOW_EXPERIMENT_NAME=spam-classifier
MLFLOW_MODEL_NAME=lr-model
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2

With that set, you can download and preprocess the kaggle data, then run the training container, mlflow server container, and webapp container using docker-compose: NOTE: data preprocessing will take 20-30 mins on an average laptop since it is front-loading all of the text embedding. Fortunately, you will only need to do this once.

make download_data
docker-compose up --build

or use the makefile to do the same (the makefile has additional testing capabilities)

make build
make up

Dataset

A subset of the data in this Phishing email dataset from Kaggle was used for model training. The 2 highest quality datasets were combined and used.

Testing

Unit testing and integration tests done using pytest. Note that the integration tests will fail if you don't have a docker daemon running

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published