Skip to content

8bitjawad/heimdall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ง Heimdall - Intelligent Email Classifier

Live Demo Python scikit-learn

Automatically categorize your emails into 6 distinct categories with 98%+ accuracy using machine learning

๐Ÿ”— Try it Live! | ๐Ÿ“Š View Project | ๐Ÿ“ Documentation


๐ŸŽฏ Problem Statement

In today's digital world, users receive hundreds of emails daily โ€” from promotions, social media notifications, forums, verification codes, to general updates.

The Challenge:

  • โฐ Manually sorting through emails is time-consuming
  • ๐Ÿšจ Misclassified emails (especially spam or verification codes) can have serious consequences
  • ๐Ÿ“ฌ Important emails get lost in the clutter

Our Solution: An automated email classification system powered by machine learning.


โœจ Features

  • ๐Ÿค– Three ML Models - Compare predictions from Naive Bayes, Logistic Regression, and Random Forest
  • ๐ŸŽฏ High Accuracy - Achieves 97-98% classification accuracy
  • ๐Ÿ“Š Confidence Scoring - See how confident each model is about its prediction
  • ๐Ÿš€ Real-time Classification - Instant results via interactive web interface
  • ๐Ÿงน Automatic Text Preprocessing - Handles cleaning, tokenization, and feature extraction
  • ๐ŸŒ Cloud Deployed - Access anywhere via Streamlit Cloud

๐Ÿ“‚ Email Categories

The system classifies emails into 6 distinct categories:

Category Description Example
๐ŸŽ Promotions Marketing emails, sales, offers "50% OFF - Limited Time Sale!"
๐Ÿšซ Spam Unwanted/suspicious emails "You've won $1,000,000!"
๐Ÿ“ฑ Social Media Notifications from social platforms "John liked your post"
๐Ÿ’ฌ Forum Discussion boards, community updates "New reply to your thread"
๐Ÿ” Verification Code OTP, 2FA codes, account verification "Your code is 482917"
๐Ÿ“ฐ Updates General newsletters, product updates "Weekly digest from Medium"

๐Ÿ› ๏ธ Solution Architecture

Input Email Text
       โ†“
Text Preprocessing Pipeline
  โ€ข Lowercasing
  โ€ข Punctuation Removal
  โ€ข Stopwords Removal  
  โ€ข Lemmatization
       โ†“
TF-IDF Vectorization
       โ†“
Three ML Models (Parallel Prediction)
  โ€ข Naive Bayes
  โ€ข Logistic Regression
  โ€ข Random Forest
       โ†“
Category + Confidence Score

๐Ÿ“Š Model Performance

Evaluated on 2,696 test emails across 6 categories:

Model Accuracy Precision Recall F1-Score Key Strength
Logistic Regression โญ 98.48% 98.49% 98.48% 98.48% Best overall performer
Random Forest 98.18% 98.20% 98.18% 98.18% Robust predictions
Naive Bayes 97.92% 97.94% 97.92% 97.92% Fastest inference

๐ŸŽฏ Category-wise Performance

All models achieve 97-100% precision across categories:

  • โœ… Verification Code: 99-100% (most reliable)
  • โœ… Promotions: 98-100% (highly accurate)
  • โœ… Spam: 98-99% (excellent detection)
  • โœ… Social Media: 98-99% (consistent)
  • โœ… Forum: 96-98% (strong performance)
  • โš ๏ธ Updates: 96-97% (slightly challenging due to varied content)
image

๐Ÿ’ป Tech Stack

Python scikit-learn NLTK Pandas Streamlit

  • Machine Learning: scikit-learn (Naive Bayes, Logistic Regression, Random Forest)
  • NLP Processing: NLTK (stopwords, lemmatization)
  • Feature Engineering: TF-IDF Vectorization
  • Model Persistence: Joblib
  • Web Framework: Streamlit
  • Data Handling: Pandas, NumPy
  • Deployment: Streamlit Cloud

๐Ÿš€ Quick Start

Prerequisites

Python 3.8+
pip or conda package manager

Installation

1๏ธโƒฃ Clone the repository

git clone https://github.com/yourusername/email-classifier.git
cd email-classifier

2๏ธโƒฃ Install dependencies

pip install -r requirements.txt

3๏ธโƒฃ Download NLTK data (first time only)

python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')"

4๏ธโƒฃ Run the app

streamlit run app.py

5๏ธโƒฃ Open your browser and navigate to http://localhost:8501


๐ŸŽฎ How to Use

Web Interface

  1. Enter Email Text - Paste any email content into the text area
  2. Click Classify - Models automatically preprocess and predict
  3. View Results - See predictions from all three models with confidence scores

Example Usage

Input:

Your verification code is 482917. 
Please use this code within 10 minutes.

Output:

=== Model Predictions ===

Naive Bayes: 
  Category: verify_code 
  Confidence: 99.2% โœ…

Logistic Regression: 
  Category: verify_code 
  Confidence: 99.0% โœ…

Random Forest: 
  Category: verify_code 
  Confidence: 86.0% โœ…

๐Ÿ“ Project Structure

email-classifier/
โ”œโ”€โ”€ app.py                      # Streamlit web application
โ”œโ”€โ”€ utils.py                    # Preprocessing & prediction functions
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ naive_bayes_model.pkl   # Trained Naive Bayes model
โ”‚   โ”œโ”€โ”€ logistic_model.pkl      # Trained Logistic Regression
โ”‚   โ”œโ”€โ”€ random_forest_model.pkl # Trained Random Forest
โ”‚   โ””โ”€โ”€ tfidf_vectorizer.pkl    # Fitted TF-IDF vectorizer
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ training.ipynb          # Model training & evaluation
โ””โ”€โ”€ README.md                   # Project documentation

๐Ÿ”ฌ Methodology

1. Data Preprocessing Pipeline

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text_fn(text):
    text = text.lower()  # lowercase
    text = ''.join([c for c in text if c not in string.punctuation])  # remove punctuation
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]  # remove stopwords & lemmatize
    return ' '.join(words)

df['clean_text'] = df['text'].apply(clean_text_fn)

2. Feature Extraction

  • TF-IDF Vectorization converts preprocessed text into numerical features
  • Captures word importance across the entire corpus
  • Reduces dimensionality while preserving semantic meaning

3. Model Selection Rationale

Why Logistic Regression?

  • โšก Best accuracy (98.48%) among all models
  • ๐Ÿš€ Fast inference - suitable for real-time applications
  • ๐Ÿ“Š Balanced precision-recall across all categories
  • ๐Ÿ’พ Low memory footprint - ideal for deployment

๐Ÿ“ˆ Future Enhancements

  • ๐Ÿ“ฆ Batch Processing - Upload CSV files with multiple emails
  • ๐Ÿ“Š Advanced Visualizations - Interactive charts for predictions
  • ๐Ÿ”„ Active Learning - User feedback loop to improve accuracy
  • ๐ŸŒ Multi-language Support - Classify emails in different languages
  • ๐Ÿ“ง Email API Integration - Direct Gmail/Outlook integration
  • ๐Ÿง  Deep Learning Models - Experiment with BERT/Transformers
  • ๐Ÿ“ฑ Mobile App - Native iOS/Android application

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

โญ Star this repo if you found it helpful!

Live Demo โ€ข Report Bug โ€ข Request Feature

Made with โค๏ธ and โ˜•

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published