Automatically extract invoices and receipts from Gmail using AI-powered email analysis
v2.0 Major Release: Classifies emails by analyzing subject and body text (not attachment content) using EXACT keyword matching, then saves all attachments when a match is found. Focused exclusively on invoices and receipts - auto-detects All Mail folder (works with any Gmail language) for keywords: invoice, receipt, fatura, factura, recibo.
- π§ Email-Based Classification - Analyzes email subject + body (not attachment content)
- π― EXACT Match Only - Requires exact keywords (no approximations or fuzzy matching)
- π Auto-Detect All Mail - Automatically finds All Mail folder in any Gmail language (EN/PT/ES/DE/FR/IT)
- β‘ 3x Faster Processing - No PDF text extraction, no NLP, no pattern matching
- π Batch Attachment Saving - Saves ALL attachments when email matches
- π 5 Exact Keywords - invoice, receipt, fatura, factura, recibo (case-insensitive, whole word)
- π Smart Deduplication - SHA256 hashing prevents duplicate file saves
- π Flexible Organization - Multiple output structures (by type, by date, flat)
- π Resume Capability - Continue from where you left off with checkpoint system
- π Multi-Language - Supports Portuguese, English, Spanish
- β‘ Rate Limit Protection - Adaptive delays prevent Gmail API throttling
- π³ Docker Ready - Complete Docker and Docker Compose support
- π Secure - App passwords only, credentials never stored in code
# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper
# Create .env file
cp .env.example .env
# Edit .env with your Gmail credentials
# Run with Docker Compose
docker-compose run --rm gmail-scraper --interactiveSee DOCKER.md for complete Docker documentation.
# Clone repository
git clone https://github.com/yourusername/gmail-doc-scrapper.git
cd gmail-doc-scrapper
# Automated setup (Unix/Linux/macOS)
./setup.sh
# Or Windows PowerShell
.\setup.ps1# Interactive mode
python main.py --interactiveExample:
Gmail email: your-email@gmail.com
Gmail App Password: xxxx-xxxx-xxxx-xxxx
Start date: 2024-01-01
End date: 2024-12-31
Folder: INBOX
- Installation Guide - Detailed setup instructions
- Docker Guide - Docker & Docker Compose setup
- Quick Start - Get running in 5 minutes
- Resume Functionality - Continue interrupted runs
- Testing Guide - Running tests
- Contributing - How to contribute
# Interactive mode
python main.py --interactive
# Resume from checkpoint
python main.py --resume
# Specific date range
python main.py --start-date 2024-01-01 --end-date 2024-12-31
# Extract only invoices
python main.py --document-types invoices
# Docker interactive
docker-compose run --rm gmail-scraper --interactiveGMAIL_EMAIL=your-email@gmail.com
GMAIL_APP_PASSWORD=your-app-passwordinvoices:
display_name: "Invoices"
keywords:
- invoice # EXACT word match (case-insensitive)
- fatura # Portuguese
- factura # Portuguese/Spanish alternative
patterns: [] # Disabled - exact match only
entities: [] # Disabled - exact match only
receipts:
display_name: "Receipts"
keywords:
- receipt # EXACT word match (case-insensitive)
- recibo # Portuguese
patterns: [] # Disabled - exact match only
entities: [] # Disabled - exact match onlyExamples:
- β "Invoice #123" β matches (contains exact word "invoice")
- β "FATURA mensal" β matches (contains exact word "fatura")
- β "invoicing system" β NO match ("invoicing" β "invoice")
- β "bill payment" β NO match ("bill" not in keyword list)
- Enable IMAP in Gmail Settings
- Enable 2-Factor Authentication
- Generate App Password at Google Account Security
- Add to
.envfile
output/
βββ invoices/
β βββ 2024-01/
β β βββ invoice_001.pdf
β β βββ invoice_002.pdf
β β βββ scan_003.jpg # All attachments from invoice emails
β βββ 2024-02/
β βββ invoice_004.pdf
βββ receipts/
β βββ 2024-01/
β β βββ receipt_001.pdf
β β βββ receipt_002.png
β βββ 2024-02/
β βββ receipt_003.pdf
βββ metadata.json # All file records with classification info
reports/
βββ report_20240101_120000.json
βββ .checkpoint.json # Resume progress tracking
βββ .last_run.json # Last run configuration
v2.0 introduces a fundamental shift in how documents are classified:
For each email with attachments:
β Extract text from each attachment (PDF, DOCX)
β Classify based on attachment content
β Save attachment if classified
Problems with v1.0:
- Slow (PDF text extraction for every file)
- Failed on image attachments/scans
- Missed documents with vague content but clear subjects
For each email:
β Classify based on email SUBJECT + BODY
β If match found AND has attachments:
β Save ALL attachments to classified folder
Benefits of v2.0:
- β‘ 3x faster - No PDF text extraction
- π― More accurate - Email subjects usually clearly identify document type
- π Handles all file types - Works with images, scans, any attachment
- π§ Subject priority - Email subjects like "Invoice #2024-001" weighted 3x
If you're upgrading from v1.0:
- Rules still work - Patterns like "Invoice #\d+" match email subjects perfectly
- Test first - Run with
--dry-runto validate - Adjust threshold - Consider lowering
confidence_thresholdfrom 0.7 to 0.5 - Clear output - Delete old
output/directory before first v2.0 run
This project uses email-based EXACT keyword matching with these limitations:
- Strict Matching: Only finds emails with EXACT keywords (invoice, receipt, fatura, factura, recibo)
- Language Support: English and Portuguese only (keywords hardcoded)
- Synonyms: Will NOT match synonyms like "bill", "payment slip", "nota fiscal"
- Must be Exact: "invoicing" will NOT match "invoice" (whole word required)
- Custom Keywords: Add to rules.yaml if you need additional exact keywords
Enhanced LLM-Powered Solution Available!
I offer a premium add-on with advanced capabilities:
β LLM-Based Classification
- 95%+ accuracy using GPT-4/Claude
- Understands context, not just patterns
- Handles complex and multi-page documents
- Works with scanned/OCR documents
β Advanced Document Analysis
- Extract structured metadata (dates, amounts, parties, line items)
- Multi-language support (30+ languages)
- Custom document types without manual configuration
- Confidence scoring with explanations
β CSV/Excel Export
- Export extracted metadata to CSV/Excel format
- Customizable fields and column mapping
- Batch export capabilities
- Direct integration with accounting software (QuickBooks, Xero, SAP)
β Production Support
- Priority email support with SLA
- Custom integrations and API endpoints
- Training for your specific document types
- Dedicated support channel
Interested? Contact me for pricing, demo, and trial access:
π§ Email: joao.fernandes@docdigitizer.com
Subject: "Gmail Scraper - LLM Add-on Interest"
Include in your email:
- Current document volume (emails/month)
- Document types you need to process
- Required languages
- Integration needs (if any)
# Run all tests
pytest tests/ -v
# Quick test (10 emails)
python test_quick.py
# Installation verification
python test_installation.py"Authentication failed"
- Use App Password (not regular password)
- Verify 2FA is enabled
- Check IMAP is enabled
"Too many consecutive fetch failures"
- Gmail rate limiting detected
- Wait 15-30 minutes
- Run
python main.py --resume
"spaCy model not found"
python -m spacy download pt_core_news_lgSee INSTALLATION.md for detailed troubleshooting.
Contributions are welcome! See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file.
- spaCy - NLP library
- pdfplumber - PDF extraction
- Rich - Terminal UI
- Click - CLI framework
- Issues: GitHub Issues
- Discussions: GitHub Discussions
π§ Email: joao.fernandes@docdigitizer.com
Services:
- LLM-powered classification add-on (95%+ accuracy)
- Advanced metadata extraction and CSV export
- Custom integrations and API development
- Training and consultation
- Production deployment support
Made with β€οΈ for document automation
β Star this repository if you find it useful!
π¬ Questions? Contact: joao.fernandes@docdigitizer.com