Email Domain Classifier

A Python library for classifying emails by domain using dual-method validation with optional LLM-based semantic classification. Designed for processing large datasets efficiently with streaming processing.

Screenshots

Dataset Analysis

Analyze your dataset before classification to understand label distribution, body lengths, and data quality:

Classification in Progress

Real-time progress tracking with configuration display:

Classification Results

Domain distribution table showing classification breakdown:

Processing Summary

Detailed results with validation stats, performance metrics, and output files:

Quick Start

# Install
git clone git@github.com:montimage/email-domain-classifier.git && cd email-domain-classifier
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Analyze dataset
email-cli info sample_emails.csv

# Classify emails
email-cli sample_emails.csv -o output/

# Classify with LLM (requires additional setup - see LLM Classification section)
email-cli sample_emails.csv -o output/ --use-llm

Supported Domains

Domain	Description
Finance	Banking, payments, financial services
Technology	Software, hardware, IT services
Retail	E-commerce, shopping, consumer goods
Logistics	Shipping, supply chain, transportation
Healthcare	Medical services, health insurance
Government	Public sector, regulatory agencies
HR	Human resources, recruitment
Telecommunications	Phone, internet, communication services
Social Media	Social platforms, networking services
Education	Schools, universities, learning platforms

Module Structure

Module	Description
`email_classifier/analyzer.py`	Dataset analysis (info command)
`email_classifier/classifier.py`	Core classification logic
`email_classifier/cli.py`	Command-line interface
`email_classifier/domains.py`	Domain definitions and profiles
`email_classifier/llm/`	LLM-based classification (optional)
`email_classifier/processor.py`	CSV streaming processor
`email_classifier/reporter.py`	Report generation
`email_classifier/ui.py`	Terminal UI components
`email_classifier/validator.py`	Email validation
`tests/`	Test suite
`docs/`	Documentation
`raw-data/`	Sample input datasets (Git LFS)
`classified-data/`	Example classification outputs (Git LFS)

Example Data

The repository includes sample datasets for testing and reference:

Input (raw-data/):

CEAS_08.csv - CEAS 2008 email dataset (~39K emails, 68MB)
sample_emails.csv - Small sample for quick testing (100 emails)

Output (classified-data/ceas_08/):

email_[domain].csv - Emails classified by domain (finance, technology, retail, etc.)
email_unsure.csv - Emails that couldn't be confidently classified
invalid_emails.csv - Emails that failed validation
skipped_emails.csv - Emails filtered by body length
classification_report.json - Detailed statistics
classification_report.txt - Human-readable summary

# Try with sample data
email-cli info raw-data/sample_emails.csv
email-cli raw-data/sample_emails.csv -o output/

# Or use the full CEAS dataset
email-cli raw-data/CEAS_08.csv -o classified-data/my_output/

Note: Large CSV files are stored with Git LFS. Run git lfs pull after cloning to download them.

LLM Classification (Optional)

The classifier supports an optional LLM-based Method 3 that uses semantic analysis for improved classification accuracy. This method complements the existing keyword taxonomy and structural template methods.

Installation

Install with LLM support for your preferred provider:

# For Ollama (local, free, no API key required)
pip install -e ".[ollama]"

# For Google Gemini
pip install -e ".[google]"

# For Mistral AI
pip install -e ".[mistral]"

# For Groq (fast inference)
pip install -e ".[groq]"

# For OpenRouter (access to multiple models)
pip install -e ".[openrouter]"

# Install all providers
pip install -e ".[all-llm]"

Configuration

Copy the example configuration:
```
cp .env.example .env
```

Edit .env with your settings:

# For Ollama (local)
LLM_PROVIDER=ollama
LLM_MODEL=llama3.2

# For Google Gemini
LLM_PROVIDER=google
GOOGLE_API_KEY=your-api-key

# For other providers, set the appropriate API key

Usage

# Enable LLM classification (hybrid mode - default)
email-cli sample_emails.csv -o output/ --use-llm

# Force LLM for every email (original three-method behavior)
email-cli sample_emails.csv -o output/ --use-llm --force-llm

Hybrid Workflow (Default with `--use-llm`)

When you enable LLM classification with --use-llm, the classifier uses a hybrid workflow by default:

Run both classic classifiers on each email:
- Method 1: Keyword Taxonomy Matching
- Method 2: Structural Template Matching
Check for agreement:
- If both classifiers agree on the domain → accept that result and skip the LLM (saves API costs/time)
- If they disagree → invoke the LLM to determine the final classification

This hybrid approach significantly reduces LLM API calls while maintaining classification accuracy. In typical datasets, 60-80% of emails can be classified without LLM involvement.

Workflow Modes

Command	Mode	Description
`email-cli data.csv -o out/`	Dual-method	Keyword + Structural only (no LLM)
`email-cli data.csv -o out/ --use-llm`	Hybrid	LLM only when classifiers disagree
`email-cli data.csv -o out/ --use-llm --force-llm`	Force LLM	LLM for every email (three-method)

Hybrid Workflow Output

When using hybrid mode, additional output is generated:

hybrid_workflow.jsonl - Structured JSON Lines log of every workflow step:

{"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "keyword_classify", "result": "finance"}
{"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "structural_classify", "result": "finance"}
{"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "agreement_check", "path": "classic_only", "result": "finance"}

Hybrid statistics in reports:

HYBRID WORKFLOW STATISTICS
────────────────────────────────────────
Total Processed (Hybrid): 1,000
Classic Agreement Count:  750
Agreement Rate:           75.0%
LLM Calls Made:           250
LLM Savings:              75.0%
LLM Avg Response Time:    1,234ms

Real-time Status Bar

During hybrid processing, a real-time status bar shows the current step:

⠋ Classifying with Keyword Taxonomy...
⠋ Classifying with Structural Template...
⠋ Classifiers agree - accepting 'finance'
⠋ Classifiers disagree - invoking LLM...
⠋ LLM responded in 1234ms - classified as 'technology'

Supported Providers

Provider	Package	Default Model	API Key Required
Ollama	`langchain-ollama`	`llama3.2`	No (local)
Google	`langchain-google-genai`	`gemini-2.0-flash`	Yes
Mistral	`langchain-mistralai`	`mistral-large-latest`	Yes
Groq	`langchain-groq`	`llama-3.3-70b-versatile`	Yes
OpenRouter	`langchain-openai`	(specify)	Yes

Method Weights

When LLM is enabled, the classification weights are:

Method 1 (Keyword Taxonomy): 35%
Method 2 (Structural Template): 25%
Method 3 (LLM): 40%

Weights can be customized in the .env file:

LLM_WEIGHT=0.40
KEYWORD_WEIGHT=0.35
STRUCTURAL_WEIGHT=0.25

Migration Note

Breaking change for --use-llm users: Previously, --use-llm invoked the LLM for every email. Now it uses hybrid mode by default (LLM only on classifier disagreement). To restore the previous behavior, add --force-llm:

# Old behavior (LLM for every email)
email-cli data.csv -o output/ --use-llm --force-llm

# New default (hybrid mode)
email-cli data.csv -o output/ --use-llm

Documentation

Full documentation is available in docs/:

License

Apache License 2.0 - see LICENSE file for details.

Contact

Built by Montimage Security Research

GitHub: montimage/email-domain-classifier
Issues: Issue Tracker
Email: developer@montimage.com

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
classified-data		classified-data
docs		docs
email_classifier		email_classifier
raw-data		raw-data
screenshots		screenshots
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
agents-instruction.md		agents-instruction.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_emails.csv		sample_emails.csv
setup-dev-env.sh		setup-dev-env.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email Domain Classifier

Screenshots

Dataset Analysis

Classification in Progress

Classification Results

Processing Summary

Quick Start

Supported Domains

Module Structure

Example Data

LLM Classification (Optional)

Installation

Configuration

Usage

Hybrid Workflow (Default with `--use-llm`)

Workflow Modes

Hybrid Workflow Output

Real-time Status Bar

Supported Providers

Method Weights

Migration Note

Documentation

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

Montimage/email-domain-classifier

Folders and files

Latest commit

History

Repository files navigation

Email Domain Classifier

Screenshots

Dataset Analysis

Classification in Progress

Classification Results

Processing Summary

Quick Start

Supported Domains

Module Structure

Example Data

LLM Classification (Optional)

Installation

Configuration

Usage

Hybrid Workflow (Default with --use-llm)

Workflow Modes

Hybrid Workflow Output

Real-time Status Bar

Supported Providers

Method Weights

Migration Note

Documentation

License

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Hybrid Workflow (Default with `--use-llm`)

Packages