📬 MailSort AI

Automatically categorize your emails with intelligent AI classification.

🎯 What is MailSort AI?

MailSort AI is a smart email classifier that automatically sorts incoming messages into categories like Work, Personal, Finance, and Spam.

✅ Privacy-first — runs completely on your machine
✅ Fast — classifies emails in milliseconds
✅ Accurate — uses advanced machine learning models
✅ Easy to use — simple command-line interface
✅ Customizable — train on your own email data

⚡ Quick Start (30 seconds)

1️⃣ Install dependencies

pip install -r requirements.txt

2️⃣ Train the model

python app.py

3️⃣ Classify an email

python app.py --predict "Hey, are you free for coffee later?"

Output:

Input: Hey, are you free for coffee later?
-> Predicted: Personal

✨ That's it! Your classifier is ready.

📖 Full Usage Guide

Basic Classification

Classify a single email:

python app.py --predict "Limited time offer! Buy now!"

Classify multiple emails from a file (one per line):

python app.py --predict-file emails.txt

Train on your own dataset:

python app.py --csv my_emails.csv

Your CSV must have columns: text and category

🚀 Advanced Options

Model Selection

Option 1: Fast TF-IDF + Logistic Regression (default)

python app.py --tune

✅ Fast | ✅ Lightweight | ⚠️ Good accuracy

Option 2: Embeddings + XGBoost (more accurate)

python app.py --use-embeddings --random-search

✅ High accuracy | ⚠️ Slower | ⚠️ Requires more memory

Option 3: Stacking Ensemble (best accuracy) ⭐

python app.py --ensemble --use-embeddings --calibrate --resample

✅ Highest accuracy | ✅ Confidence scores | ⚠️ Slower training

Training Enhancements

Handle imbalanced data (upsample minority classes):

python app.py --resample

Calibrate probabilities (for reliable confidence scores):

python app.py --ensemble --calibrate

Plot learning curves (see model improvement):

python app.py --plot-learning-curve

Prediction Options

Require confidence threshold (reject uncertain predictions):

python app.py --predict "Your email here" --min-confidence 0.7

Use custom embedding model:

python app.py --use-embeddings --embed-model "all-MiniLM-L6-v2"

💡 Usage Examples

Example 1: Classify work emails

python app.py --predict "Meeting at 3 PM tomorrow in conference room B"

Example 2: Batch classify from file

# Create emails.txt with one email per line
python app.py --predict-file emails.txt --min-confidence 0.8

Example 3: Train on custom data with best model

python app.py \
  --csv my_labeled_emails.csv \
  --ensemble \
  --use-embeddings \
  --calibrate \
  --resample \
  --save models/my_classifier.joblib

Example 4: Production-ready with high confidence

python app.py \
  --ensemble \
  --use-embeddings \
  --predict "Check out our new product!" \
  --min-confidence 0.85

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip or conda

Step-by-Step

Clone or download this repository
Install Python dependencies:

python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txt

If scikit-learn fails to build (on some systems):

conda install -c conda-forge scikit-learn pandas joblib sentence-transformers xgboost

📊 How It Works

Email Text
    ↓
Preprocessing (clean, lowercase, remove URLs)
    ↓
Feature Extraction (TF-IDF or Embeddings)
    ↓
ML Model (Logistic Regression / XGBoost / Ensemble)
    ↓
Category Prediction (Work / Personal / Finance / Spam)

Available Categories

📧 Work — meetings, projects, deadlines
👥 Personal — friends, family, social
💰 Finance — invoices, receipts, payments
⚠️ Spam — ads, scams, unwanted offers

📋 Command Reference

Command	Purpose
`python app.py`	Train model and evaluate
`python app.py --predict "text"`	Classify single email
`python app.py --predict-file file.txt`	Batch classify
`python app.py --use-embeddings`	Use embeddings model
`python app.py --ensemble`	Use stacking ensemble
`python app.py --calibrate`	Calibrate probabilities
`python app.py --resample`	Balance classes
`python app.py --tune`	Hyperparameter tuning
`python app.py --csv data.csv`	Train on custom data
`python app.py --min-confidence 0.8`	Confidence threshold
`python app.py --help`	Show all options

🎓 Getting Started with Your Own Data

Prepare Your Dataset

Create a CSV file with two columns:

text,category
"Meeting tomorrow at 9am",Work
"Let's grab dinner!",Personal
"Invoice #12345",Finance
"YOU WON FREE MONEY!!!",Spam

Train Your Model

python app.py --csv my_emails.csv --ensemble --use-embeddings

Evaluate Results

The model prints accuracy and a classification report automatically.

⚙️ Model Comparison

Model	Speed	Accuracy	Memory	Best For
TF-IDF + LogReg	⚡⚡⚡	⭐⭐	💾	Quick prototyping
Embeddings + XGB	⚡⚡	⭐⭐⭐	💾💾	Good balance
Stacking Ensemble	⚡	⭐⭐⭐⭐	💾💾💾	Production use

🚨 Troubleshooting

Q: Model training is slow
A: Use python app.py --tune instead of --ensemble --use-embeddings

Q: Low accuracy on predictions
A: Train with more labeled examples. Quality > Quantity.

Q: "AttributeError: 'list' object has no attribute 'apply'"
A: Update to the latest version

Q: sklearn/sentence-transformers won't install
A: Use conda: conda install -c conda-forge scikit-learn sentence-transformers xgboost

📝 Dataset Format

Your CSV file must have exactly two columns:

Column	Type	Example
`text`	string	"Meeting tomorrow"
`category`	string	"Work"

💪 Features

✨ Multiple Models

Traditional TF-IDF with Logistic Regression
Modern embeddings with XGBoost
Stacking ensemble (best accuracy)

🎯 Smart Preprocessing

Automatic URL removal
Email address cleaning
Stop-word removal
Punctuation normalization

🔧 Advanced Tuning

Hyperparameter optimization (GridSearch/RandomSearch)
Class rebalancing (upsampling)
Probability calibration
Learning curve plotting

🛡️ Production-Ready

Confidence thresholds
Model persistence (save/load)
StratifiedKFold cross-validation
Detailed classification reports

📚 Requirements

All dependencies are listed in requirements.txt:

scikit-learn>=1.0
pandas
joblib
numpy
matplotlib
sentence-transformers
xgboost

📄 License

This project is licensed under the MIT License — see LICENSE file for details.

🤝 Contributing

Found a bug or have an idea? Feel free to open an issue or submit a pull request!

📞 Support

For questions or issues, check the troubleshooting section above or create an issue on GitHub.

Happy classifying! 🚀📬

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
models		models
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

omexzihan/MailSort-AI

Folders and files

Latest commit

History

Repository files navigation