Spam checking using vector embeddings

Filtering spam mail is an arms race between hackers and security teams, but most spam emails in my inbox look really obviously to spot. Using vectors to represent the text content of the emails, I found that even a really simple classifier could get an F1 score of 0.97 on a clean dataset.

As part of the MLOps-zoomcamp course, I've developed this into an end-to-end cloud pipeline that lets you simply type in some text and classify whether or not it's spam.

Try it out!

Link (server may take some time to start up)

Architecture

Installation instructions

Clone the repo and install python packages

git clone https://github.com/amorsi1/MLOps_spam_classifier
pip install pipenv
cd Embedded-spam-MLOps
pipenv install --dev

a .env file is used to centralize environmental variables, before running any code locally make sure to create this file and populate it with the following variables:

MLFLOW_TRACKING_URI=http://mlflow-server:8080 
MLFLOW_EXPERIMENT_NAME=spam-classifier
MLFLOW_MODEL_NAME=lr-model
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2

With that set, you can download and preprocess the kaggle data, then run the training container, mlflow server container, and webapp container using docker-compose: NOTE: data preprocessing will take 20-30 mins on an average laptop since it is front-loading all of the text embedding. Fortunately, you will only need to do this once.

make download_data
docker-compose up --build

or use the makefile to do the same (the makefile has additional testing capabilities)

make build
make up

Dataset

A subset of the data in this Phishing email dataset from Kaggle was used for model training. The 2 highest quality datasets were combined and used.

Testing

Unit testing and integration tests done using pytest. Note that the integration tests will fail if you don't have a docker daemon running

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
Docker		Docker
cpu		cpu
data		data
src		src
templates		templates
tests		tests
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam checking using vector embeddings

Try it out!

Architecture

Installation instructions

Dataset

Testing

About

Uh oh!

Releases

Packages

Languages

amorsi1/Embedded-spam-MLOps

Folders and files

Latest commit

History

Repository files navigation

Spam checking using vector embeddings

Try it out!

Architecture

Installation instructions

Dataset

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages