📧 Gmail Invoice & Receipt Extractor v2.0

Automatically extract invoices and receipts from Gmail using AI-powered email analysis

v2.0 Major Release: Classifies emails by analyzing subject and body text (not attachment content) using EXACT keyword matching, then saves all attachments when a match is found. Focused exclusively on invoices and receipts - auto-detects All Mail folder (works with any Gmail language) for keywords: invoice, receipt, fatura, factura, recibo.

✨ Features

v2.0 New Capabilities

📧 Email-Based Classification - Analyzes email subject + body (not attachment content)
🎯 EXACT Match Only - Requires exact keywords (no approximations or fuzzy matching)
🌍 Auto-Detect All Mail - Automatically finds All Mail folder in any Gmail language (EN/PT/ES/DE/FR/IT)
⚡ 3x Faster Processing - No PDF text extraction, no NLP, no pattern matching
📎 Batch Attachment Saving - Saves ALL attachments when email matches
🔑 5 Exact Keywords - invoice, receipt, fatura, factura, recibo (case-insensitive, whole word)

Core Features

🔄 Smart Deduplication - SHA256 hashing prevents duplicate file saves
📁 Flexible Organization - Multiple output structures (by type, by date, flat)
🔌 Resume Capability - Continue from where you left off with checkpoint system
🌍 Multi-Language - Supports Portuguese, English, Spanish
⚡ Rate Limit Protection - Adaptive delays prevent Gmail API throttling
🐳 Docker Ready - Complete Docker and Docker Compose support
🔒 Secure - App passwords only, credentials never stored in code

🚀 Quick Start

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper

# Create .env file
cp .env.example .env
# Edit .env with your Gmail credentials

# Run with Docker Compose
docker-compose run --rm gmail-scraper --interactive

See DOCKER.md for complete Docker documentation.

Option 2: Local Installation

# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper

# Automated setup (Unix/Linux/macOS)
./setup.sh

# Or Windows PowerShell
.\setup.ps1

First Run

# Interactive mode
python main.py --interactive

Example:

Gmail email: your-email@gmail.com
Gmail App Password: xxxx-xxxx-xxxx-xxxx
Start date: 2024-01-01
End date: 2024-12-31
Folder: INBOX

📖 Documentation

Installation Guide - Detailed setup instructions
Docker Guide - Docker & Docker Compose setup
Quick Start - Get running in 5 minutes
Resume Functionality - Continue interrupted runs
Testing Guide - Running tests
Contributing - How to contribute

🎯 Usage Examples

# Interactive mode
python main.py --interactive

# Resume from checkpoint
python main.py --resume

# Specific date range
python main.py --start-date 2024-01-01 --end-date 2024-12-31

# Extract only invoices
python main.py --document-types invoices

# Docker interactive
docker-compose run --rm gmail-scraper --interactive

⚙️ Configuration

Environment Variables (.env)

GMAIL_EMAIL=your-email@gmail.com
GMAIL_APP_PASSWORD=your-app-password

Classification Rules (config/rules.yaml) - EXACT MATCH ONLY

invoices:
  display_name: "Invoices"
  keywords:
    - invoice    # EXACT word match (case-insensitive)
    - fatura     # Portuguese
    - factura    # Portuguese/Spanish alternative
  patterns: []   # Disabled - exact match only
  entities: []   # Disabled - exact match only

receipts:
  display_name: "Receipts"
  keywords:
    - receipt    # EXACT word match (case-insensitive)
    - recibo     # Portuguese
  patterns: []   # Disabled - exact match only
  entities: []   # Disabled - exact match only

Examples:

✅ "Invoice #123" → matches (contains exact word "invoice")
✅ "FATURA mensal" → matches (contains exact word "fatura")
❌ "invoicing system" → NO match ("invoicing" ≠ "invoice")
❌ "bill payment" → NO match ("bill" not in keyword list)

🔐 Gmail Setup

Enable IMAP in Gmail Settings
Enable 2-Factor Authentication
Generate App Password at Google Account Security
Add to .env file

📊 Output Structure

output/
├── invoices/
│   ├── 2024-01/
│   │   ├── invoice_001.pdf
│   │   ├── invoice_002.pdf
│   │   └── scan_003.jpg      # All attachments from invoice emails
│   └── 2024-02/
│       └── invoice_004.pdf
├── receipts/
│   ├── 2024-01/
│   │   ├── receipt_001.pdf
│   │   └── receipt_002.png
│   └── 2024-02/
│       └── receipt_003.pdf
└── metadata.json              # All file records with classification info

reports/
├── report_20240101_120000.json
├── .checkpoint.json           # Resume progress tracking
└── .last_run.json             # Last run configuration

🆕 What's New in v2.0

Major Architectural Change

v2.0 introduces a fundamental shift in how documents are classified:

v1.0 Approach (Deprecated)

For each email with attachments:
  → Extract text from each attachment (PDF, DOCX)
  → Classify based on attachment content
  → Save attachment if classified

Problems with v1.0:

Slow (PDF text extraction for every file)
Failed on image attachments/scans
Missed documents with vague content but clear subjects

v2.0 Approach (Current)

For each email:
  → Classify based on email SUBJECT + BODY
  → If match found AND has attachments:
    → Save ALL attachments to classified folder

Benefits of v2.0:

⚡ 3x faster - No PDF text extraction
🎯 More accurate - Email subjects usually clearly identify document type
📎 Handles all file types - Works with images, scans, any attachment
📧 Subject priority - Email subjects like "Invoice #2024-001" weighted 3x

Migration from v1.0

If you're upgrading from v1.0:

Rules still work - Patterns like "Invoice #\d+" match email subjects perfectly
Test first - Run with --dry-run to validate
Adjust threshold - Consider lowering confidence_threshold from 0.7 to 0.5
Clear output - Delete old output/ directory before first v2.0 run

⚠️ Limitations & Known Issues

Current Classification Limitations

This project uses email-based EXACT keyword matching with these limitations:

Strict Matching: Only finds emails with EXACT keywords (invoice, receipt, fatura, factura, recibo)
Language Support: English and Portuguese only (keywords hardcoded)
Synonyms: Will NOT match synonyms like "bill", "payment slip", "nota fiscal"
Must be Exact: "invoicing" will NOT match "invoice" (whole word required)
Custom Keywords: Add to rules.yaml if you need additional exact keywords

🚀 Need Better Classification?

Enhanced LLM-Powered Solution Available!

I offer a premium add-on with advanced capabilities:

✅ LLM-Based Classification

95%+ accuracy using GPT-4/Claude
Understands context, not just patterns
Handles complex and multi-page documents
Works with scanned/OCR documents

✅ Advanced Document Analysis

Extract structured metadata (dates, amounts, parties, line items)
Multi-language support (30+ languages)
Custom document types without manual configuration
Confidence scoring with explanations

✅ CSV/Excel Export

Export extracted metadata to CSV/Excel format
Customizable fields and column mapping
Batch export capabilities
Direct integration with accounting software (QuickBooks, Xero, SAP)

✅ Production Support

Priority email support with SLA
Custom integrations and API endpoints
Training for your specific document types
Dedicated support channel

Interested? Contact me for pricing, demo, and trial access:

📧 Email: joao.fernandes@docdigitizer.com

Subject: "Gmail Scraper - LLM Add-on Interest"

Include in your email:

Current document volume (emails/month)
Document types you need to process
Required languages
Integration needs (if any)

🧪 Testing

# Run all tests
pytest tests/ -v

# Quick test (10 emails)
python test_quick.py

# Installation verification
python test_installation.py

🛠️ Troubleshooting

"Authentication failed"

Use App Password (not regular password)
Verify 2FA is enabled
Check IMAP is enabled

"Too many consecutive fetch failures"

Gmail rate limiting detected
Wait 15-30 minutes
Run python main.py --resume

"spaCy model not found"

python -m spacy download pt_core_news_lg

See INSTALLATION.md for detailed troubleshooting.

🤝 Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

📝 License

MIT License - see LICENSE file.

🙏 Acknowledgments

spaCy - NLP library
pdfplumber - PDF extraction
Rich - Terminal UI
Click - CLI framework

📞 Support

Community Support (Free)

Issues: GitHub Issues
Discussions: GitHub Discussions

Commercial Support & Add-ons

📧 Email: joao.fernandes@docdigitizer.com

Services:

LLM-powered classification add-on (95%+ accuracy)
Advanced metadata extraction and CSV export
Custom integrations and API development
Training and consultation
Production deployment support

Made with ❤️ for document automation

⭐ Star this repository if you find it useful!

💬 Questions? Contact: joao.fernandes@docdigitizer.com

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTOMATIC_FOLDER_SEARCH.md		AUTOMATIC_FOLDER_SEARCH.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONNECTION_KEEPALIVE.md		CONNECTION_KEEPALIVE.md
CONTRIBUTING.md		CONTRIBUTING.md
DOCKER.md		DOCKER.md
DOCUMENTATION.md		DOCUMENTATION.md
Dockerfile		Dockerfile
FOLDER_SEARCH_EXAMPLES.md		FOLDER_SEARCH_EXAMPLES.md
FOLDER_SELECTION_DEMO.md		FOLDER_SELECTION_DEMO.md
INSTALLATION.md		INSTALLATION.md
INVOICE_CLASSIFICATION_ALGORITHM.md		INVOICE_CLASSIFICATION_ALGORITHM.md
LICENSE		LICENSE
Makefile		Makefile
OPEN_SOURCE_SUMMARY.md		OPEN_SOURCE_SUMMARY.md
PRE_RELEASE_CHECKLIST.md		PRE_RELEASE_CHECKLIST.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
RELEASE_STATUS.md		RELEASE_STATUS.md
RESUME_GUIDE.md		RESUME_GUIDE.md
TESTING_INSTRUCTIONS.md		TESTING_INSTRUCTIONS.md
TEST_GUIDE.md		TEST_GUIDE.md
docker-compose.yml		docker-compose.yml
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.ps1		setup.ps1
setup.py		setup.py
setup.sh		setup.sh
test_installation.py		test_installation.py
test_quick.py		test_quick.py

License

DocDigitizer/gmail-doc-scrapper

Folders and files

Latest commit

History

Repository files navigation