Automatically categorize your emails into 6 distinct categories with 98%+ accuracy using machine learning
๐ Try it Live! | ๐ View Project | ๐ Documentation
In today's digital world, users receive hundreds of emails daily โ from promotions, social media notifications, forums, verification codes, to general updates.
The Challenge:
- โฐ Manually sorting through emails is time-consuming
- ๐จ Misclassified emails (especially spam or verification codes) can have serious consequences
- ๐ฌ Important emails get lost in the clutter
Our Solution: An automated email classification system powered by machine learning.
- ๐ค Three ML Models - Compare predictions from Naive Bayes, Logistic Regression, and Random Forest
- ๐ฏ High Accuracy - Achieves 97-98% classification accuracy
- ๐ Confidence Scoring - See how confident each model is about its prediction
- ๐ Real-time Classification - Instant results via interactive web interface
- ๐งน Automatic Text Preprocessing - Handles cleaning, tokenization, and feature extraction
- ๐ Cloud Deployed - Access anywhere via Streamlit Cloud
The system classifies emails into 6 distinct categories:
| Category | Description | Example |
|---|---|---|
| ๐ Promotions | Marketing emails, sales, offers | "50% OFF - Limited Time Sale!" |
| ๐ซ Spam | Unwanted/suspicious emails | "You've won $1,000,000!" |
| ๐ฑ Social Media | Notifications from social platforms | "John liked your post" |
| ๐ฌ Forum | Discussion boards, community updates | "New reply to your thread" |
| ๐ Verification Code | OTP, 2FA codes, account verification | "Your code is 482917" |
| ๐ฐ Updates | General newsletters, product updates | "Weekly digest from Medium" |
Input Email Text
โ
Text Preprocessing Pipeline
โข Lowercasing
โข Punctuation Removal
โข Stopwords Removal
โข Lemmatization
โ
TF-IDF Vectorization
โ
Three ML Models (Parallel Prediction)
โข Naive Bayes
โข Logistic Regression
โข Random Forest
โ
Category + Confidence Score
Evaluated on 2,696 test emails across 6 categories:
| Model | Accuracy | Precision | Recall | F1-Score | Key Strength |
|---|---|---|---|---|---|
| Logistic Regression โญ | 98.48% | 98.49% | 98.48% | 98.48% | Best overall performer |
| Random Forest | 98.18% | 98.20% | 98.18% | 98.18% | Robust predictions |
| Naive Bayes | 97.92% | 97.94% | 97.92% | 97.92% | Fastest inference |
All models achieve 97-100% precision across categories:
- โ Verification Code: 99-100% (most reliable)
- โ Promotions: 98-100% (highly accurate)
- โ Spam: 98-99% (excellent detection)
- โ Social Media: 98-99% (consistent)
- โ Forum: 96-98% (strong performance)
โ ๏ธ Updates: 96-97% (slightly challenging due to varied content)
- Machine Learning: scikit-learn (Naive Bayes, Logistic Regression, Random Forest)
- NLP Processing: NLTK (stopwords, lemmatization)
- Feature Engineering: TF-IDF Vectorization
- Model Persistence: Joblib
- Web Framework: Streamlit
- Data Handling: Pandas, NumPy
- Deployment: Streamlit Cloud
Python 3.8+
pip or conda package manager1๏ธโฃ Clone the repository
git clone https://github.com/yourusername/email-classifier.git
cd email-classifier2๏ธโฃ Install dependencies
pip install -r requirements.txt3๏ธโฃ Download NLTK data (first time only)
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')"4๏ธโฃ Run the app
streamlit run app.py5๏ธโฃ Open your browser and navigate to http://localhost:8501
- Enter Email Text - Paste any email content into the text area
- Click Classify - Models automatically preprocess and predict
- View Results - See predictions from all three models with confidence scores
Input:
Your verification code is 482917.
Please use this code within 10 minutes.
Output:
=== Model Predictions ===
Naive Bayes:
Category: verify_code
Confidence: 99.2% โ
Logistic Regression:
Category: verify_code
Confidence: 99.0% โ
Random Forest:
Category: verify_code
Confidence: 86.0% โ
email-classifier/
โโโ app.py # Streamlit web application
โโโ utils.py # Preprocessing & prediction functions
โโโ requirements.txt # Python dependencies
โโโ models/
โ โโโ naive_bayes_model.pkl # Trained Naive Bayes model
โ โโโ logistic_model.pkl # Trained Logistic Regression
โ โโโ random_forest_model.pkl # Trained Random Forest
โ โโโ tfidf_vectorizer.pkl # Fitted TF-IDF vectorizer
โโโ notebooks/
โ โโโ training.ipynb # Model training & evaluation
โโโ README.md # Project documentation
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def clean_text_fn(text):
text = text.lower() # lowercase
text = ''.join([c for c in text if c not in string.punctuation]) # remove punctuation
words = text.split()
words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words] # remove stopwords & lemmatize
return ' '.join(words)
df['clean_text'] = df['text'].apply(clean_text_fn)- TF-IDF Vectorization converts preprocessed text into numerical features
- Captures word importance across the entire corpus
- Reduces dimensionality while preserving semantic meaning
Why Logistic Regression?
- โก Best accuracy (98.48%) among all models
- ๐ Fast inference - suitable for real-time applications
- ๐ Balanced precision-recall across all categories
- ๐พ Low memory footprint - ideal for deployment
- ๐ฆ Batch Processing - Upload CSV files with multiple emails
- ๐ Advanced Visualizations - Interactive charts for predictions
- ๐ Active Learning - User feedback loop to improve accuracy
- ๐ Multi-language Support - Classify emails in different languages
- ๐ง Email API Integration - Direct Gmail/Outlook integration
- ๐ง Deep Learning Models - Experiment with BERT/Transformers
- ๐ฑ Mobile App - Native iOS/Android application
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Live Demo โข Report Bug โข Request Feature
Made with โค๏ธ and โ