📧 Heimdall - Intelligent Email Classifier

Automatically categorize your emails into 6 distinct categories with 98%+ accuracy using machine learning

🔗 Try it Live! | 📊 View Project | 📝 Documentation

🎯 Problem Statement

In today's digital world, users receive hundreds of emails daily — from promotions, social media notifications, forums, verification codes, to general updates.

The Challenge:

⏰ Manually sorting through emails is time-consuming
🚨 Misclassified emails (especially spam or verification codes) can have serious consequences
📬 Important emails get lost in the clutter

Our Solution: An automated email classification system powered by machine learning.

✨ Features

🤖 Three ML Models - Compare predictions from Naive Bayes, Logistic Regression, and Random Forest
🎯 High Accuracy - Achieves 97-98% classification accuracy
📊 Confidence Scoring - See how confident each model is about its prediction
🚀 Real-time Classification - Instant results via interactive web interface
🧹 Automatic Text Preprocessing - Handles cleaning, tokenization, and feature extraction
🌐 Cloud Deployed - Access anywhere via Streamlit Cloud

📂 Email Categories

The system classifies emails into 6 distinct categories:

Category	Description	Example
🎁 Promotions	Marketing emails, sales, offers	"50% OFF - Limited Time Sale!"
🚫 Spam	Unwanted/suspicious emails	"You've won $1,000,000!"
📱 Social Media	Notifications from social platforms	"John liked your post"
💬 Forum	Discussion boards, community updates	"New reply to your thread"
🔐 Verification Code	OTP, 2FA codes, account verification	"Your code is 482917"
📰 Updates	General newsletters, product updates	"Weekly digest from Medium"

🛠️ Solution Architecture

Input Email Text
       ↓
Text Preprocessing Pipeline
  • Lowercasing
  • Punctuation Removal
  • Stopwords Removal  
  • Lemmatization
       ↓
TF-IDF Vectorization
       ↓
Three ML Models (Parallel Prediction)
  • Naive Bayes
  • Logistic Regression
  • Random Forest
       ↓
Category + Confidence Score

📊 Model Performance

Evaluated on 2,696 test emails across 6 categories:

Model	Accuracy	Precision	Recall	F1-Score	Key Strength
Logistic Regression ⭐	98.48%	98.49%	98.48%	98.48%	Best overall performer
Random Forest	98.18%	98.20%	98.18%	98.18%	Robust predictions
Naive Bayes	97.92%	97.94%	97.92%	97.92%	Fastest inference

🎯 Category-wise Performance

All models achieve 97-100% precision across categories:

✅ Verification Code: 99-100% (most reliable)
✅ Promotions: 98-100% (highly accurate)
✅ Spam: 98-99% (excellent detection)
✅ Social Media: 98-99% (consistent)
✅ Forum: 96-98% (strong performance)
⚠️ Updates: 96-97% (slightly challenging due to varied content)

💻 Tech Stack

Machine Learning: scikit-learn (Naive Bayes, Logistic Regression, Random Forest)
NLP Processing: NLTK (stopwords, lemmatization)
Feature Engineering: TF-IDF Vectorization
Model Persistence: Joblib
Web Framework: Streamlit
Data Handling: Pandas, NumPy
Deployment: Streamlit Cloud

🚀 Quick Start

Prerequisites

Python 3.8+
pip or conda package manager

Installation

1️⃣ Clone the repository

git clone https://github.com/yourusername/email-classifier.git
cd email-classifier

2️⃣ Install dependencies

pip install -r requirements.txt

3️⃣ Download NLTK data (first time only)

python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')"

4️⃣ Run the app

streamlit run app.py

5️⃣ Open your browser and navigate to http://localhost:8501

🎮 How to Use

Web Interface

Enter Email Text - Paste any email content into the text area
Click Classify - Models automatically preprocess and predict
View Results - See predictions from all three models with confidence scores

Example Usage

Input:

Your verification code is 482917. 
Please use this code within 10 minutes.

Output:

=== Model Predictions ===

Naive Bayes: 
  Category: verify_code 
  Confidence: 99.2% ✅

Logistic Regression: 
  Category: verify_code 
  Confidence: 99.0% ✅

Random Forest: 
  Category: verify_code 
  Confidence: 86.0% ✅

📁 Project Structure

email-classifier/
├── app.py                      # Streamlit web application
├── utils.py                    # Preprocessing & prediction functions
├── requirements.txt            # Python dependencies
├── models/
│   ├── naive_bayes_model.pkl   # Trained Naive Bayes model
│   ├── logistic_model.pkl      # Trained Logistic Regression
│   ├── random_forest_model.pkl # Trained Random Forest
│   └── tfidf_vectorizer.pkl    # Fitted TF-IDF vectorizer
├── notebooks/
│   └── training.ipynb          # Model training & evaluation
└── README.md                   # Project documentation

🔬 Methodology

1. Data Preprocessing Pipeline

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text_fn(text):
    text = text.lower()  # lowercase
    text = ''.join([c for c in text if c not in string.punctuation])  # remove punctuation
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]  # remove stopwords & lemmatize
    return ' '.join(words)

df['clean_text'] = df['text'].apply(clean_text_fn)

2. Feature Extraction

TF-IDF Vectorization converts preprocessed text into numerical features
Captures word importance across the entire corpus
Reduces dimensionality while preserving semantic meaning

3. Model Selection Rationale

Why Logistic Regression?

⚡ Best accuracy (98.48%) among all models
🚀 Fast inference - suitable for real-time applications
📊 Balanced precision-recall across all categories
💾 Low memory footprint - ideal for deployment

📈 Future Enhancements

📦 Batch Processing - Upload CSV files with multiple emails
📊 Advanced Visualizations - Interactive charts for predictions
🔄 Active Learning - User feedback loop to improve accuracy
🌍 Multi-language Support - Classify emails in different languages
📧 Email API Integration - Direct Gmail/Outlook integration
🧠 Deep Learning Models - Experiment with BERT/Transformers
📱 Mobile App - Native iOS/Android application

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

⭐ Star this repo if you found it helpful!

Live Demo • Report Bug • Request Feature

Made with ❤️ and ☕

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
models		models
README.md		README.md
Untitled.ipynb		Untitled.ipynb
app.py		app.py
model.ipynb		model.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📧 Heimdall - Intelligent Email Classifier

🎯 Problem Statement

✨ Features

📂 Email Categories

🛠️ Solution Architecture

📊 Model Performance

🎯 Category-wise Performance

💻 Tech Stack

🚀 Quick Start

Prerequisites

Installation

🎮 How to Use

Web Interface

Example Usage

📁 Project Structure

🔬 Methodology

1. Data Preprocessing Pipeline

2. Feature Extraction

3. Model Selection Rationale

📈 Future Enhancements

🤝 Contributing

⭐ Star this repo if you found it helpful!

About

Uh oh!

Releases

Packages

Languages

8bitjawad/heimdall

Folders and files

Latest commit

History

Repository files navigation

📧 Heimdall - Intelligent Email Classifier

🎯 Problem Statement

✨ Features

📂 Email Categories

🛠️ Solution Architecture

📊 Model Performance

🎯 Category-wise Performance

💻 Tech Stack

🚀 Quick Start

Prerequisites

Installation

🎮 How to Use

Web Interface

Example Usage

📁 Project Structure

🔬 Methodology

1. Data Preprocessing Pipeline

2. Feature Extraction

3. Model Selection Rationale

📈 Future Enhancements

🤝 Contributing

⭐ Star this repo if you found it helpful!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages