Skip to content

Automatically extract and organize invoices from your Gmail inbox using AI. Combines spaCy NLP with regex patterns for accurate invoice detection. SHA256 deduplication, checkpoint/resume, multi-language, Docker support. Open source Python tool for accounting workflow automation.

License

Notifications You must be signed in to change notification settings

DocDigitizer/gmail-doc-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“§ Gmail Invoice & Receipt Extractor v2.0

Automatically extract invoices and receipts from Gmail using AI-powered email analysis

Python 3.9+ Version 2.0 License: MIT Docker Code style: black

v2.0 Major Release: Classifies emails by analyzing subject and body text (not attachment content) using EXACT keyword matching, then saves all attachments when a match is found. Focused exclusively on invoices and receipts - auto-detects All Mail folder (works with any Gmail language) for keywords: invoice, receipt, fatura, factura, recibo.

✨ Features

v2.0 New Capabilities

  • πŸ“§ Email-Based Classification - Analyzes email subject + body (not attachment content)
  • 🎯 EXACT Match Only - Requires exact keywords (no approximations or fuzzy matching)
  • 🌍 Auto-Detect All Mail - Automatically finds All Mail folder in any Gmail language (EN/PT/ES/DE/FR/IT)
  • ⚑ 3x Faster Processing - No PDF text extraction, no NLP, no pattern matching
  • πŸ“Ž Batch Attachment Saving - Saves ALL attachments when email matches
  • πŸ”‘ 5 Exact Keywords - invoice, receipt, fatura, factura, recibo (case-insensitive, whole word)

Core Features

  • πŸ”„ Smart Deduplication - SHA256 hashing prevents duplicate file saves
  • πŸ“ Flexible Organization - Multiple output structures (by type, by date, flat)
  • πŸ”Œ Resume Capability - Continue from where you left off with checkpoint system
  • 🌍 Multi-Language - Supports Portuguese, English, Spanish
  • ⚑ Rate Limit Protection - Adaptive delays prevent Gmail API throttling
  • 🐳 Docker Ready - Complete Docker and Docker Compose support
  • πŸ”’ Secure - App passwords only, credentials never stored in code

πŸš€ Quick Start

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper

# Create .env file
cp .env.example .env
# Edit .env with your Gmail credentials

# Run with Docker Compose
docker-compose run --rm gmail-scraper --interactive

See DOCKER.md for complete Docker documentation.

Option 2: Local Installation

# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper

# Automated setup (Unix/Linux/macOS)
./setup.sh

# Or Windows PowerShell
.\setup.ps1

First Run

# Interactive mode
python main.py --interactive

Example:

Gmail email: your-email@gmail.com
Gmail App Password: xxxx-xxxx-xxxx-xxxx
Start date: 2024-01-01
End date: 2024-12-31
Folder: INBOX

πŸ“– Documentation

🎯 Usage Examples

# Interactive mode
python main.py --interactive

# Resume from checkpoint
python main.py --resume

# Specific date range
python main.py --start-date 2024-01-01 --end-date 2024-12-31

# Extract only invoices
python main.py --document-types invoices

# Docker interactive
docker-compose run --rm gmail-scraper --interactive

βš™οΈ Configuration

Environment Variables (.env)

GMAIL_EMAIL=your-email@gmail.com
GMAIL_APP_PASSWORD=your-app-password

Classification Rules (config/rules.yaml) - EXACT MATCH ONLY

invoices:
  display_name: "Invoices"
  keywords:
    - invoice    # EXACT word match (case-insensitive)
    - fatura     # Portuguese
    - factura    # Portuguese/Spanish alternative
  patterns: []   # Disabled - exact match only
  entities: []   # Disabled - exact match only

receipts:
  display_name: "Receipts"
  keywords:
    - receipt    # EXACT word match (case-insensitive)
    - recibo     # Portuguese
  patterns: []   # Disabled - exact match only
  entities: []   # Disabled - exact match only

Examples:

  • βœ… "Invoice #123" β†’ matches (contains exact word "invoice")
  • βœ… "FATURA mensal" β†’ matches (contains exact word "fatura")
  • ❌ "invoicing system" β†’ NO match ("invoicing" β‰  "invoice")
  • ❌ "bill payment" β†’ NO match ("bill" not in keyword list)

πŸ” Gmail Setup

  1. Enable IMAP in Gmail Settings
  2. Enable 2-Factor Authentication
  3. Generate App Password at Google Account Security
  4. Add to .env file

πŸ“Š Output Structure

output/
β”œβ”€β”€ invoices/
β”‚   β”œβ”€β”€ 2024-01/
β”‚   β”‚   β”œβ”€β”€ invoice_001.pdf
β”‚   β”‚   β”œβ”€β”€ invoice_002.pdf
β”‚   β”‚   └── scan_003.jpg      # All attachments from invoice emails
β”‚   └── 2024-02/
β”‚       └── invoice_004.pdf
β”œβ”€β”€ receipts/
β”‚   β”œβ”€β”€ 2024-01/
β”‚   β”‚   β”œβ”€β”€ receipt_001.pdf
β”‚   β”‚   └── receipt_002.png
β”‚   └── 2024-02/
β”‚       └── receipt_003.pdf
└── metadata.json              # All file records with classification info

reports/
β”œβ”€β”€ report_20240101_120000.json
β”œβ”€β”€ .checkpoint.json           # Resume progress tracking
└── .last_run.json             # Last run configuration

πŸ†• What's New in v2.0

Major Architectural Change

v2.0 introduces a fundamental shift in how documents are classified:

v1.0 Approach (Deprecated)

For each email with attachments:
  β†’ Extract text from each attachment (PDF, DOCX)
  β†’ Classify based on attachment content
  β†’ Save attachment if classified

Problems with v1.0:

  • Slow (PDF text extraction for every file)
  • Failed on image attachments/scans
  • Missed documents with vague content but clear subjects

v2.0 Approach (Current)

For each email:
  β†’ Classify based on email SUBJECT + BODY
  β†’ If match found AND has attachments:
    β†’ Save ALL attachments to classified folder

Benefits of v2.0:

  • ⚑ 3x faster - No PDF text extraction
  • 🎯 More accurate - Email subjects usually clearly identify document type
  • πŸ“Ž Handles all file types - Works with images, scans, any attachment
  • πŸ“§ Subject priority - Email subjects like "Invoice #2024-001" weighted 3x

Migration from v1.0

If you're upgrading from v1.0:

  1. Rules still work - Patterns like "Invoice #\d+" match email subjects perfectly
  2. Test first - Run with --dry-run to validate
  3. Adjust threshold - Consider lowering confidence_threshold from 0.7 to 0.5
  4. Clear output - Delete old output/ directory before first v2.0 run

⚠️ Limitations & Known Issues

Current Classification Limitations

This project uses email-based EXACT keyword matching with these limitations:

  • Strict Matching: Only finds emails with EXACT keywords (invoice, receipt, fatura, factura, recibo)
  • Language Support: English and Portuguese only (keywords hardcoded)
  • Synonyms: Will NOT match synonyms like "bill", "payment slip", "nota fiscal"
  • Must be Exact: "invoicing" will NOT match "invoice" (whole word required)
  • Custom Keywords: Add to rules.yaml if you need additional exact keywords

πŸš€ Need Better Classification?

Enhanced LLM-Powered Solution Available!

I offer a premium add-on with advanced capabilities:

βœ… LLM-Based Classification

  • 95%+ accuracy using GPT-4/Claude
  • Understands context, not just patterns
  • Handles complex and multi-page documents
  • Works with scanned/OCR documents

βœ… Advanced Document Analysis

  • Extract structured metadata (dates, amounts, parties, line items)
  • Multi-language support (30+ languages)
  • Custom document types without manual configuration
  • Confidence scoring with explanations

βœ… CSV/Excel Export

  • Export extracted metadata to CSV/Excel format
  • Customizable fields and column mapping
  • Batch export capabilities
  • Direct integration with accounting software (QuickBooks, Xero, SAP)

βœ… Production Support

  • Priority email support with SLA
  • Custom integrations and API endpoints
  • Training for your specific document types
  • Dedicated support channel

Interested? Contact me for pricing, demo, and trial access:

πŸ“§ Email: joao.fernandes@docdigitizer.com

Subject: "Gmail Scraper - LLM Add-on Interest"

Include in your email:

  • Current document volume (emails/month)
  • Document types you need to process
  • Required languages
  • Integration needs (if any)

πŸ§ͺ Testing

# Run all tests
pytest tests/ -v

# Quick test (10 emails)
python test_quick.py

# Installation verification
python test_installation.py

πŸ› οΈ Troubleshooting

"Authentication failed"

  • Use App Password (not regular password)
  • Verify 2FA is enabled
  • Check IMAP is enabled

"Too many consecutive fetch failures"

  • Gmail rate limiting detected
  • Wait 15-30 minutes
  • Run python main.py --resume

"spaCy model not found"

python -m spacy download pt_core_news_lg

See INSTALLATION.md for detailed troubleshooting.

🀝 Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

πŸ“ License

MIT License - see LICENSE file.

πŸ™ Acknowledgments

πŸ“ž Support

Community Support (Free)

Commercial Support & Add-ons

πŸ“§ Email: joao.fernandes@docdigitizer.com

Services:

  • LLM-powered classification add-on (95%+ accuracy)
  • Advanced metadata extraction and CSV export
  • Custom integrations and API development
  • Training and consultation
  • Production deployment support

Made with ❀️ for document automation

⭐ Star this repository if you find it useful!

πŸ’¬ Questions? Contact: joao.fernandes@docdigitizer.com

About

Automatically extract and organize invoices from your Gmail inbox using AI. Combines spaCy NLP with regex patterns for accurate invoice detection. SHA256 deduplication, checkpoint/resume, multi-language, Docker support. Open source Python tool for accounting workflow automation.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published