Convert PDF bills to CSV format using Google's Gemini 2.5 Flash API for intelligent extraction and categorization of financial transactions.
- Extracts expense detail tables from multi-page PDF bills
- Outputs structured CSV with Date, Description, Payee, Amount, and Category columns
- Automatic payee/merchant extraction from transaction descriptions
- Multi-language support: Preserves original language in payee names (Chinese, Japanese, etc.)
- Smart date extraction: Prioritizes transaction date over posting date
- Hierarchical categorization support (e.g., "Food & Dining > Restaurants")
- Customizable categories via external markdown file
- Secure API key management via macOS Keychain or environment variables
- Validates and normalizes data (dates, amounts, descriptions, payees, categories)
- Isolates invalid rows in separate error file
- Optional metadata generation
- Debug mode for troubleshooting API responses
- Automatic retry logic for API resilience
git clone https://github.com/pyang2045/bill2csv.git
cd bill2csv
pip install -e .- Using macOS Keychain (Recommended):
security add-generic-password -a "bill2csv" -s "gemini-api" -w "YOUR_API_KEY" -U- Using Environment Variable:
export GEMINI_API_KEY="your_api_key"bill2csv invoice.pdf# Specify output directory
bill2csv invoice.pdf --outdir ./output
# Generate metadata file
bill2csv invoice.pdf --meta
# Use specific Keychain credentials
bill2csv invoice.pdf --keychain-service gemini-api --keychain-account bill2csv
# Strict mode (fail on any invalid rows)
bill2csv invoice.pdf --strict
# Quiet mode (suppress logs)
bill2csv invoice.pdf --quiet
# Use custom categories file
bill2csv invoice.pdf --categories ~/my_categories.md
# Debug mode (saves raw API response)
bill2csv invoice.pdf --debug- Date: DD-MM-YYYY format (e.g., 13-06-2018)
- Uses transaction date when available (when purchase occurred)
- Falls back to posting date if transaction date not found
- Description: Cleaned transaction description with symbols replaced by spaces
- Example:
WALMART#1234→WALMART 1234 - Example:
7-ELEVEN_STORE→7 ELEVEN STORE - Quoted if contains commas
- Example:
- Payee: Extracted merchant/vendor name from description
- Preserves original language/script from description
- Example:
WALMART#1234*STORE→Walmart - Example:
星巴克咖啡#12345→星巴克咖啡(Chinese) - Example:
セブンイレブン#1234→セブンイレブン(Japanese) - Example:
AMZ*MKTP US*2Y4T85TN2→Amazon Marketplace - Example:
STARBUCKS STORE #12345→Starbucks
- Amount: Decimal with sign (negative for charges, positive for credits)
- Category: Hierarchical categories with > separator
- Main level:
Food & Dining,Transportation,Shopping - Sub-categories:
Food & Dining > Restaurants,Transportation > Gas & Fuel
- Main level:
<filename>.csv- Main output with valid rows<filename>.errors.csv- Invalid rows (if any)<filename>.meta.json- Metadata (if --meta flag used)
- Python 3.9+
- macOS (for Keychain support, optional)
- Gemini API key
- Dependencies:
google-genai>=0.1.0,pypdf>=3.17.0
The tool uses hierarchical categories defined in expense_categories.md. The file is searched in this order:
- Custom file specified with
--categoriesflag - Current working directory:
./expense_categories.md - User config directory:
~/.bill2csv/expense_categories.md - Falls back to built-in defaults if not found
Categories are defined in markdown with indentation indicating hierarchy:
- Food & Dining
- Restaurants
- Groceries
- Coffee Shops
- Transportation
- Gas & Fuel
- Public Transit
- Rideshare & TaxiIn CSV output, sub-categories appear as: Food & Dining > Restaurants
- Model: Gemini 2.5 Flash (gemini-2.5-flash)
- Temperature: 0.1 for consistent results
- Max Tokens: 32768 (supports large bills)
- Retry Logic: Automatic retry with exponential backoff for API errors
- Design Document - Technical architecture and implementation details
- Specification - Original product requirements and specifications
- Categories Configuration - Default expense category hierarchy
MIT