Skip to content

MailSort AI — Intelligent email classification using machine learning. Automatically categorize emails into Work, Personal, Finance, or Spam with TF-IDF, embeddings, or ensemble models. Privacy-first, runs locally. Python + scikit-learn + XGBoost.

License

Notifications You must be signed in to change notification settings

omexzihan/MailSort-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📬 MailSort AI

Automatically categorize your emails with intelligent AI classification.

Python Scikit-Learn License Status


🎯 What is MailSort AI?

MailSort AI is a smart email classifier that automatically sorts incoming messages into categories like Work, Personal, Finance, and Spam.

Privacy-first — runs completely on your machine
Fast — classifies emails in milliseconds
Accurate — uses advanced machine learning models
Easy to use — simple command-line interface
Customizable — train on your own email data


⚡ Quick Start (30 seconds)

1️⃣ Install dependencies

pip install -r requirements.txt

2️⃣ Train the model

python app.py

3️⃣ Classify an email

python app.py --predict "Hey, are you free for coffee later?"

Output:

Input: Hey, are you free for coffee later?
-> Predicted: Personal

That's it! Your classifier is ready.


📖 Full Usage Guide

Basic Classification

Classify a single email:

python app.py --predict "Limited time offer! Buy now!"

Classify multiple emails from a file (one per line):

python app.py --predict-file emails.txt

Train on your own dataset:

python app.py --csv my_emails.csv

Your CSV must have columns: text and category


🚀 Advanced Options

Model Selection

Option 1: Fast TF-IDF + Logistic Regression (default)

python app.py --tune

✅ Fast | ✅ Lightweight | ⚠️ Good accuracy

Option 2: Embeddings + XGBoost (more accurate)

python app.py --use-embeddings --random-search

✅ High accuracy | ⚠️ Slower | ⚠️ Requires more memory

Option 3: Stacking Ensemble (best accuracy) ⭐

python app.py --ensemble --use-embeddings --calibrate --resample

✅ Highest accuracy | ✅ Confidence scores | ⚠️ Slower training

Training Enhancements

Handle imbalanced data (upsample minority classes):

python app.py --resample

Calibrate probabilities (for reliable confidence scores):

python app.py --ensemble --calibrate

Plot learning curves (see model improvement):

python app.py --plot-learning-curve

Prediction Options

Require confidence threshold (reject uncertain predictions):

python app.py --predict "Your email here" --min-confidence 0.7

Use custom embedding model:

python app.py --use-embeddings --embed-model "all-MiniLM-L6-v2"

💡 Usage Examples

Example 1: Classify work emails

python app.py --predict "Meeting at 3 PM tomorrow in conference room B"

Example 2: Batch classify from file

# Create emails.txt with one email per line
python app.py --predict-file emails.txt --min-confidence 0.8

Example 3: Train on custom data with best model

python app.py \
  --csv my_labeled_emails.csv \
  --ensemble \
  --use-embeddings \
  --calibrate \
  --resample \
  --save models/my_classifier.joblib

Example 4: Production-ready with high confidence

python app.py \
  --ensemble \
  --use-embeddings \
  --predict "Check out our new product!" \
  --min-confidence 0.85

🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • pip or conda

Step-by-Step

  1. Clone or download this repository

  2. Install Python dependencies:

python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txt
  1. If scikit-learn fails to build (on some systems):
conda install -c conda-forge scikit-learn pandas joblib sentence-transformers xgboost

📊 How It Works

Email Text
    ↓
Preprocessing (clean, lowercase, remove URLs)
    ↓
Feature Extraction (TF-IDF or Embeddings)
    ↓
ML Model (Logistic Regression / XGBoost / Ensemble)
    ↓
Category Prediction (Work / Personal / Finance / Spam)

Available Categories

  • 📧 Work — meetings, projects, deadlines
  • 👥 Personal — friends, family, social
  • 💰 Finance — invoices, receipts, payments
  • ⚠️ Spam — ads, scams, unwanted offers

📋 Command Reference

Command Purpose
python app.py Train model and evaluate
python app.py --predict "text" Classify single email
python app.py --predict-file file.txt Batch classify
python app.py --use-embeddings Use embeddings model
python app.py --ensemble Use stacking ensemble
python app.py --calibrate Calibrate probabilities
python app.py --resample Balance classes
python app.py --tune Hyperparameter tuning
python app.py --csv data.csv Train on custom data
python app.py --min-confidence 0.8 Confidence threshold
python app.py --help Show all options

🎓 Getting Started with Your Own Data

Prepare Your Dataset

Create a CSV file with two columns:

text,category
"Meeting tomorrow at 9am",Work
"Let's grab dinner!",Personal
"Invoice #12345",Finance
"YOU WON FREE MONEY!!!",Spam

Train Your Model

python app.py --csv my_emails.csv --ensemble --use-embeddings

Evaluate Results

The model prints accuracy and a classification report automatically.


⚙️ Model Comparison

Model Speed Accuracy Memory Best For
TF-IDF + LogReg ⚡⚡⚡ ⭐⭐ 💾 Quick prototyping
Embeddings + XGB ⚡⚡ ⭐⭐⭐ 💾💾 Good balance
Stacking Ensemble ⭐⭐⭐⭐ 💾💾💾 Production use

🚨 Troubleshooting

Q: Model training is slow
A: Use python app.py --tune instead of --ensemble --use-embeddings

Q: Low accuracy on predictions
A: Train with more labeled examples. Quality > Quantity.

Q: "AttributeError: 'list' object has no attribute 'apply'"
A: Update to the latest version

Q: sklearn/sentence-transformers won't install
A: Use conda: conda install -c conda-forge scikit-learn sentence-transformers xgboost


📝 Dataset Format

Your CSV file must have exactly two columns:

Column Type Example
text string "Meeting tomorrow"
category string "Work"

💪 Features

Multiple Models

  • Traditional TF-IDF with Logistic Regression
  • Modern embeddings with XGBoost
  • Stacking ensemble (best accuracy)

🎯 Smart Preprocessing

  • Automatic URL removal
  • Email address cleaning
  • Stop-word removal
  • Punctuation normalization

🔧 Advanced Tuning

  • Hyperparameter optimization (GridSearch/RandomSearch)
  • Class rebalancing (upsampling)
  • Probability calibration
  • Learning curve plotting

🛡️ Production-Ready

  • Confidence thresholds
  • Model persistence (save/load)
  • StratifiedKFold cross-validation
  • Detailed classification reports

📚 Requirements

All dependencies are listed in requirements.txt:

scikit-learn>=1.0
pandas
joblib
numpy
matplotlib
sentence-transformers
xgboost

📄 License

This project is licensed under the MIT License — see LICENSE file for details.


🤝 Contributing

Found a bug or have an idea? Feel free to open an issue or submit a pull request!


📞 Support

For questions or issues, check the troubleshooting section above or create an issue on GitHub.


Happy classifying! 🚀📬

About

MailSort AI — Intelligent email classification using machine learning. Automatically categorize emails into Work, Personal, Finance, or Spam with TF-IDF, embeddings, or ensemble models. Privacy-first, runs locally. Python + scikit-learn + XGBoost.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages