Automatically categorize your emails with intelligent AI classification.
MailSort AI is a smart email classifier that automatically sorts incoming messages into categories like Work, Personal, Finance, and Spam.
✅ Privacy-first — runs completely on your machine
✅ Fast — classifies emails in milliseconds
✅ Accurate — uses advanced machine learning models
✅ Easy to use — simple command-line interface
✅ Customizable — train on your own email data
pip install -r requirements.txtpython app.pypython app.py --predict "Hey, are you free for coffee later?"Output:
Input: Hey, are you free for coffee later?
-> Predicted: Personal
✨ That's it! Your classifier is ready.
Classify a single email:
python app.py --predict "Limited time offer! Buy now!"Classify multiple emails from a file (one per line):
python app.py --predict-file emails.txtTrain on your own dataset:
python app.py --csv my_emails.csvYour CSV must have columns:
textandcategory
Option 1: Fast TF-IDF + Logistic Regression (default)
python app.py --tune✅ Fast | ✅ Lightweight |
Option 2: Embeddings + XGBoost (more accurate)
python app.py --use-embeddings --random-search✅ High accuracy |
Option 3: Stacking Ensemble (best accuracy) ⭐
python app.py --ensemble --use-embeddings --calibrate --resample✅ Highest accuracy | ✅ Confidence scores |
Handle imbalanced data (upsample minority classes):
python app.py --resampleCalibrate probabilities (for reliable confidence scores):
python app.py --ensemble --calibratePlot learning curves (see model improvement):
python app.py --plot-learning-curveRequire confidence threshold (reject uncertain predictions):
python app.py --predict "Your email here" --min-confidence 0.7Use custom embedding model:
python app.py --use-embeddings --embed-model "all-MiniLM-L6-v2"python app.py --predict "Meeting at 3 PM tomorrow in conference room B"# Create emails.txt with one email per line
python app.py --predict-file emails.txt --min-confidence 0.8python app.py \
--csv my_labeled_emails.csv \
--ensemble \
--use-embeddings \
--calibrate \
--resample \
--save models/my_classifier.joblibpython app.py \
--ensemble \
--use-embeddings \
--predict "Check out our new product!" \
--min-confidence 0.85- Python 3.8 or higher
- pip or conda
-
Clone or download this repository
-
Install Python dependencies:
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txt- If scikit-learn fails to build (on some systems):
conda install -c conda-forge scikit-learn pandas joblib sentence-transformers xgboostEmail Text
↓
Preprocessing (clean, lowercase, remove URLs)
↓
Feature Extraction (TF-IDF or Embeddings)
↓
ML Model (Logistic Regression / XGBoost / Ensemble)
↓
Category Prediction (Work / Personal / Finance / Spam)
- 📧 Work — meetings, projects, deadlines
- 👥 Personal — friends, family, social
- 💰 Finance — invoices, receipts, payments
⚠️ Spam — ads, scams, unwanted offers
| Command | Purpose |
|---|---|
python app.py |
Train model and evaluate |
python app.py --predict "text" |
Classify single email |
python app.py --predict-file file.txt |
Batch classify |
python app.py --use-embeddings |
Use embeddings model |
python app.py --ensemble |
Use stacking ensemble |
python app.py --calibrate |
Calibrate probabilities |
python app.py --resample |
Balance classes |
python app.py --tune |
Hyperparameter tuning |
python app.py --csv data.csv |
Train on custom data |
python app.py --min-confidence 0.8 |
Confidence threshold |
python app.py --help |
Show all options |
Create a CSV file with two columns:
text,category
"Meeting tomorrow at 9am",Work
"Let's grab dinner!",Personal
"Invoice #12345",Finance
"YOU WON FREE MONEY!!!",Spampython app.py --csv my_emails.csv --ensemble --use-embeddingsThe model prints accuracy and a classification report automatically.
| Model | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|
| TF-IDF + LogReg | ⚡⚡⚡ | ⭐⭐ | 💾 | Quick prototyping |
| Embeddings + XGB | ⚡⚡ | ⭐⭐⭐ | 💾💾 | Good balance |
| Stacking Ensemble | ⚡ | ⭐⭐⭐⭐ | 💾💾💾 | Production use |
Q: Model training is slow
A: Use python app.py --tune instead of --ensemble --use-embeddings
Q: Low accuracy on predictions
A: Train with more labeled examples. Quality > Quantity.
Q: "AttributeError: 'list' object has no attribute 'apply'"
A: Update to the latest version
Q: sklearn/sentence-transformers won't install
A: Use conda: conda install -c conda-forge scikit-learn sentence-transformers xgboost
Your CSV file must have exactly two columns:
| Column | Type | Example |
|---|---|---|
text |
string | "Meeting tomorrow" |
category |
string | "Work" |
✨ Multiple Models
- Traditional TF-IDF with Logistic Regression
- Modern embeddings with XGBoost
- Stacking ensemble (best accuracy)
🎯 Smart Preprocessing
- Automatic URL removal
- Email address cleaning
- Stop-word removal
- Punctuation normalization
🔧 Advanced Tuning
- Hyperparameter optimization (GridSearch/RandomSearch)
- Class rebalancing (upsampling)
- Probability calibration
- Learning curve plotting
🛡️ Production-Ready
- Confidence thresholds
- Model persistence (save/load)
- StratifiedKFold cross-validation
- Detailed classification reports
All dependencies are listed in requirements.txt:
scikit-learn>=1.0
pandas
joblib
numpy
matplotlib
sentence-transformers
xgboost
This project is licensed under the MIT License — see LICENSE file for details.
Found a bug or have an idea? Feel free to open an issue or submit a pull request!
For questions or issues, check the troubleshooting section above or create an issue on GitHub.
Happy classifying! 🚀📬