A Python library for classifying emails by domain using dual-method validation with optional LLM-based semantic classification. Designed for processing large datasets efficiently with streaming processing.
Analyze your dataset before classification to understand label distribution, body lengths, and data quality:
Real-time progress tracking with configuration display:
Domain distribution table showing classification breakdown:
Detailed results with validation stats, performance metrics, and output files:
# Install
git clone git@github.com:montimage/email-domain-classifier.git && cd email-domain-classifier
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Analyze dataset
email-cli info sample_emails.csv
# Classify emails
email-cli sample_emails.csv -o output/
# Classify with LLM (requires additional setup - see LLM Classification section)
email-cli sample_emails.csv -o output/ --use-llm| Domain | Description |
|---|---|
| Finance | Banking, payments, financial services |
| Technology | Software, hardware, IT services |
| Retail | E-commerce, shopping, consumer goods |
| Logistics | Shipping, supply chain, transportation |
| Healthcare | Medical services, health insurance |
| Government | Public sector, regulatory agencies |
| HR | Human resources, recruitment |
| Telecommunications | Phone, internet, communication services |
| Social Media | Social platforms, networking services |
| Education | Schools, universities, learning platforms |
| Module | Description |
|---|---|
email_classifier/analyzer.py |
Dataset analysis (info command) |
email_classifier/classifier.py |
Core classification logic |
email_classifier/cli.py |
Command-line interface |
email_classifier/domains.py |
Domain definitions and profiles |
email_classifier/llm/ |
LLM-based classification (optional) |
email_classifier/processor.py |
CSV streaming processor |
email_classifier/reporter.py |
Report generation |
email_classifier/ui.py |
Terminal UI components |
email_classifier/validator.py |
Email validation |
tests/ |
Test suite |
docs/ |
Documentation |
raw-data/ |
Sample input datasets (Git LFS) |
classified-data/ |
Example classification outputs (Git LFS) |
The repository includes sample datasets for testing and reference:
Input (raw-data/):
CEAS_08.csv- CEAS 2008 email dataset (~39K emails, 68MB)sample_emails.csv- Small sample for quick testing (100 emails)
Output (classified-data/ceas_08/):
email_[domain].csv- Emails classified by domain (finance, technology, retail, etc.)email_unsure.csv- Emails that couldn't be confidently classifiedinvalid_emails.csv- Emails that failed validationskipped_emails.csv- Emails filtered by body lengthclassification_report.json- Detailed statisticsclassification_report.txt- Human-readable summary
# Try with sample data
email-cli info raw-data/sample_emails.csv
email-cli raw-data/sample_emails.csv -o output/
# Or use the full CEAS dataset
email-cli raw-data/CEAS_08.csv -o classified-data/my_output/Note: Large CSV files are stored with Git LFS. Run
git lfs pullafter cloning to download them.
The classifier supports an optional LLM-based Method 3 that uses semantic analysis for improved classification accuracy. This method complements the existing keyword taxonomy and structural template methods.
Install with LLM support for your preferred provider:
# For Ollama (local, free, no API key required)
pip install -e ".[ollama]"
# For Google Gemini
pip install -e ".[google]"
# For Mistral AI
pip install -e ".[mistral]"
# For Groq (fast inference)
pip install -e ".[groq]"
# For OpenRouter (access to multiple models)
pip install -e ".[openrouter]"
# Install all providers
pip install -e ".[all-llm]"-
Copy the example configuration:
cp .env.example .env
-
Edit
.envwith your settings:# For Ollama (local) LLM_PROVIDER=ollama LLM_MODEL=llama3.2 # For Google Gemini LLM_PROVIDER=google GOOGLE_API_KEY=your-api-key # For other providers, set the appropriate API key
# Enable LLM classification (hybrid mode - default)
email-cli sample_emails.csv -o output/ --use-llm
# Force LLM for every email (original three-method behavior)
email-cli sample_emails.csv -o output/ --use-llm --force-llmWhen you enable LLM classification with --use-llm, the classifier uses a hybrid workflow by default:
-
Run both classic classifiers on each email:
- Method 1: Keyword Taxonomy Matching
- Method 2: Structural Template Matching
-
Check for agreement:
- If both classifiers agree on the domain → accept that result and skip the LLM (saves API costs/time)
- If they disagree → invoke the LLM to determine the final classification
This hybrid approach significantly reduces LLM API calls while maintaining classification accuracy. In typical datasets, 60-80% of emails can be classified without LLM involvement.
| Command | Mode | Description |
|---|---|---|
email-cli data.csv -o out/ |
Dual-method | Keyword + Structural only (no LLM) |
email-cli data.csv -o out/ --use-llm |
Hybrid | LLM only when classifiers disagree |
email-cli data.csv -o out/ --use-llm --force-llm |
Force LLM | LLM for every email (three-method) |
When using hybrid mode, additional output is generated:
-
hybrid_workflow.jsonl- Structured JSON Lines log of every workflow step:{"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "keyword_classify", "result": "finance"} {"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "structural_classify", "result": "finance"} {"timestamp": "2024-01-15T10:30:00", "email_idx": 0, "step": "agreement_check", "path": "classic_only", "result": "finance"} -
Hybrid statistics in reports:
HYBRID WORKFLOW STATISTICS ──────────────────────────────────────── Total Processed (Hybrid): 1,000 Classic Agreement Count: 750 Agreement Rate: 75.0% LLM Calls Made: 250 LLM Savings: 75.0% LLM Avg Response Time: 1,234ms
During hybrid processing, a real-time status bar shows the current step:
⠋ Classifying with Keyword Taxonomy...
⠋ Classifying with Structural Template...
⠋ Classifiers agree - accepting 'finance'
⠋ Classifiers disagree - invoking LLM...
⠋ LLM responded in 1234ms - classified as 'technology'
| Provider | Package | Default Model | API Key Required |
|---|---|---|---|
| Ollama | langchain-ollama |
llama3.2 |
No (local) |
langchain-google-genai |
gemini-2.0-flash |
Yes | |
| Mistral | langchain-mistralai |
mistral-large-latest |
Yes |
| Groq | langchain-groq |
llama-3.3-70b-versatile |
Yes |
| OpenRouter | langchain-openai |
(specify) | Yes |
When LLM is enabled, the classification weights are:
- Method 1 (Keyword Taxonomy): 35%
- Method 2 (Structural Template): 25%
- Method 3 (LLM): 40%
Weights can be customized in the .env file:
LLM_WEIGHT=0.40
KEYWORD_WEIGHT=0.35
STRUCTURAL_WEIGHT=0.25Breaking change for --use-llm users: Previously, --use-llm invoked the LLM for every email. Now it uses hybrid mode by default (LLM only on classifier disagreement). To restore the previous behavior, add --force-llm:
# Old behavior (LLM for every email)
email-cli data.csv -o output/ --use-llm --force-llm
# New default (hybrid mode)
email-cli data.csv -o output/ --use-llmFull documentation is available in docs/:
- Installation Guide
- User Guide
- API Reference
- Architecture
- Development Playbook
- Deployment Playbook
- Troubleshooting
Apache License 2.0 - see LICENSE file for details.
Built by Montimage Security Research
- GitHub: montimage/email-domain-classifier
- Issues: Issue Tracker
- Email: developer@montimage.com



