Skip to content

Montimage/email-domain-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Email Domain Classifier

CI Status codecov PyPI version Python versions License

A Python library for classifying emails by domain using dual-method validation with optional LLM-based semantic classification. Designed for processing large datasets efficiently with streaming processing.

Screenshots

Dataset Analysis

Analyze your dataset before classification to understand label distribution, body lengths, and data quality:

Dataset Analysis

Classification in Progress

Real-time progress tracking with configuration display:

Classification Progress

Classification Results

Domain distribution table showing classification breakdown:

Classification Table

Processing Summary

Detailed results with validation stats, performance metrics, and output files:

Results Overview

Quick Start

# Install
git clone git@github.com:montimage/email-domain-classifier.git && cd email-domain-classifier
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Analyze dataset
email-cli info sample_emails.csv

# Classify emails
email-cli sample_emails.csv -o output/

# Classify with LLM (requires additional setup - see LLM Classification section)
email-cli sample_emails.csv -o output/ --use-llm

Supported Domains

Domain Description
Finance Banking, payments, financial services
Technology Software, hardware, IT services
Retail E-commerce, shopping, consumer goods
Logistics Shipping, supply chain, transportation
Healthcare Medical services, health insurance
Government Public sector, regulatory agencies
HR Human resources, recruitment
Telecommunications Phone, internet, communication services
Social Media Social platforms, networking services
Education Schools, universities, learning platforms

Module Structure

Module Description
email_classifier/analyzer.py Dataset analysis (info command)
email_classifier/classifier.py Core classification logic
email_classifier/cli.py Command-line interface
email_classifier/domains.py Domain definitions and profiles
email_classifier/llm/ LLM-based classification (optional)
email_classifier/processor.py CSV streaming processor
email_classifier/reporter.py Report generation
email_classifier/ui.py Terminal UI components
email_classifier/validator.py Email validation
tests/ Test suite
docs/ Documentation
raw-data/ Sample input datasets (Git LFS)
classified-data/ Example classification outputs (Git LFS)

Example Data

The repository includes sample datasets for testing and reference:

Input (raw-data/):

  • CEAS_08.csv - CEAS 2008 email dataset (~39K emails, 68MB)
  • sample_emails.csv - Small sample for quick testing (100 emails)

Output (classified-data/ceas_08/):

  • email_[domain].csv - Emails classified by domain (finance, technology, retail, etc.)
  • email_unsure.csv - Emails that couldn't be confidently classified
  • invalid_emails.csv - Emails that failed validation
  • skipped_emails.csv - Emails filtered by body length
  • classification_report.json - Detailed statistics
  • classification_report.txt - Human-readable summary
# Try with sample data
email-cli info raw-data/sample_emails.csv
email-cli raw-data/sample_emails.csv -o output/

# Or use the full CEAS dataset
email-cli raw-data/CEAS_08.csv -o classified-data/my_output/

Note: Large CSV files are stored with Git LFS. Run git lfs pull after cloning to download them.

LLM Classification (Optional)

The classifier supports an optional LLM-based Method 3 that uses semantic analysis for improved classification accuracy. This method complements the existing keyword taxonomy and structural template methods.

Installation

Install with LLM support for your preferred provider:

# For Ollama (local, free, no API key required)
pip install -e ".[ollama]"

# For Google Gemini
pip install -e ".[google]"

# For Mistral AI
pip install -e ".[mistral]"

# For Groq (fast inference)
pip install -e ".[groq]"

# For OpenRouter (access to multiple models)
pip install -e ".[openrouter]"

# Install all providers
pip install -e ".[all-llm]"

Configuration

  1. Copy the example configuration:

    cp .env.example .env
  2. Edit .env with your settings:

    # For Ollama (local)
    LLM_PROVIDER=ollama
    LLM_MODEL=llama3.2
    
    # For Google Gemini
    LLM_PROVIDER=google
    GOOGLE_API_KEY=your-api-key
    
    # For other providers, set the appropriate API key

Usage

# Enable LLM classification (hybrid mode - default)
email-cli sample_emails.csv -o output/ --use-llm

# Force LLM for every email (original three-method behavior)
email-cli sample_emails.csv -o output/ --use-llm --force-llm

Hybrid Workflow (Default with --use-llm)

When you enable LLM classification with --use-llm, the classifier uses a hybrid workflow by default:

  1. Run both classic classifiers on each email:

    • Method 1: Keyword Taxonomy Matching
    • Method 2: Structural Template Matching
  2. Check for agreement:

    • If both classifiers agree on the domain → accept that result and skip the LLM (saves API costs/time)
    • If they disagree → invoke the LLM to determine the final classification

This hybrid approach significantly reduces LLM API calls while maintaining classification accuracy. In typical datasets, 60-80% of emails can be classified without LLM involvement.

Workflow Modes

Command Mode Description
email-cli data.csv -o out/ Dual-method Keyword + Structural only (no LLM)
email-cli data.csv -o out/ --use-llm Hybrid LLM only when classifiers disagree
email-cli data.csv -o out/ --use-llm --force-llm Force LLM LLM for every email (three-method)

Hybrid Workflow Output

When using hybrid mode, additional output is generated:

  • hybrid_workflow.jsonl - Structured JSON Lines log of every workflow step:

    {"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "keyword_classify", "result": "finance"}
    {"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "structural_classify", "result": "finance"}
    {"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "agreement_check", "path": "classic_only", "result": "finance"}
  • Hybrid statistics in reports:

    HYBRID WORKFLOW STATISTICS
    ────────────────────────────────────────
    Total Processed (Hybrid): 1,000
    Classic Agreement Count:  750
    Agreement Rate:           75.0%
    LLM Calls Made:           250
    LLM Savings:              75.0%
    LLM Avg Response Time:    1,234ms
    

Real-time Status Bar

During hybrid processing, a real-time status bar shows the current step:

⠋ Classifying with Keyword Taxonomy...
⠋ Classifying with Structural Template...
⠋ Classifiers agree - accepting 'finance'
⠋ Classifiers disagree - invoking LLM...
⠋ LLM responded in 1234ms - classified as 'technology'

Supported Providers

Provider Package Default Model API Key Required
Ollama langchain-ollama llama3.2 No (local)
Google langchain-google-genai gemini-2.0-flash Yes
Mistral langchain-mistralai mistral-large-latest Yes
Groq langchain-groq llama-3.3-70b-versatile Yes
OpenRouter langchain-openai (specify) Yes

Method Weights

When LLM is enabled, the classification weights are:

  • Method 1 (Keyword Taxonomy): 35%
  • Method 2 (Structural Template): 25%
  • Method 3 (LLM): 40%

Weights can be customized in the .env file:

LLM_WEIGHT=0.40
KEYWORD_WEIGHT=0.35
STRUCTURAL_WEIGHT=0.25

Migration Note

Breaking change for --use-llm users: Previously, --use-llm invoked the LLM for every email. Now it uses hybrid mode by default (LLM only on classifier disagreement). To restore the previous behavior, add --force-llm:

# Old behavior (LLM for every email)
email-cli data.csv -o output/ --use-llm --force-llm

# New default (hybrid mode)
email-cli data.csv -o output/ --use-llm

Documentation

Full documentation is available in docs/:

License

Apache License 2.0 - see LICENSE file for details.

Contact

Built by Montimage Security Research

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages