A Python application that converts PDF invoices to Excel format, with support for both text-based and scanned PDFs. The application includes OCR capabilities for processing scanned documents and features a modern GUI interface with real-time processing logs.
- Converts PDF invoices to Excel format
- Supports both text-based and scanned PDFs using OCR
- Smart data extraction with OCR error correction:
- Handles common OCR mistakes in numbers and text
- Corrects product codes automatically
- Normalizes unit measurements (oz, lb, pc)
- Extracts key information including:
- Purchase and received quantities
- Product codes (CAS/PK/BAG)
- Brand information
- Product descriptions
- Cost per packet
- Total cost
- Unit cost calculations
- Modern GUI interface with:
- Real-time processing logs
- File selection dialogs
- Progress tracking
- Error handling
- Automatic Excel formatting with currency formatting
- Modular architecture for maintainability and extensibility
The codebase is organized into modules, each with specific responsibilities:
Invoice_to_Excel/
├── main.py # Entry point for the application
├── src/ # Source code directory
│ ├── converter.py # Main converter logic that ties modules together
│ ├── pdf_extraction/ # PDF text extraction module
│ │ └── extractor.py # Functions for extracting text from PDFs
│ ├── text_processing/ # Text processing module
│ │ └── processor.py # Functions for cleaning and parsing invoice text
│ ├── excel_output/ # Excel export module
│ │ └── export.py # Functions for formatting and exporting to Excel
│ └── gui/ # GUI module
│ └── app.py # User interface implementation
├── requirements.txt # Project dependencies
└── README.md # This file
- Windows Operating System
- Tesseract OCR installed (minimum version 5.0.0)
- Install to the default location:
C:\Program Files\Tesseract-OCR
- Add Tesseract to your system PATH
- Verify installation by running
tesseract --version
in command prompt
- Install to the default location:
- Python 3.x
- Required Python packages (install using
pip install -r requirements.txt
):- pandas (>=2.2.3): Data manipulation and Excel export
- pdfplumber (>=0.11.6): PDF text extraction
- pytesseract (>=0.3.13): OCR processing
- Pillow (>=11.2.1): Image processing
- openpyxl (>=3.1.5): Excel file creation
- python-dateutil (>=2.9.0): Date handling
- pyinstaller (>=6.13.0): For creating executable
- Download
Invoice_to_Excel.exe
from thedist
folder - Install Tesseract OCR:
- Download from Tesseract GitHub Releases
- Run installer and choose default location
- Add to system PATH during installation
- Double-click the executable to run
- Clone this repository
- Install Python 3.x
- Install Tesseract OCR as described above
- Install required packages:
pip install -r requirements.txt
- Run the application:
python main.py
- Launch the application
- Click "Browse..." to select your PDF invoice
- Choose where to save the Excel output file
- Click "Process PDF" to start the conversion
- Monitor progress in the log window
- Excel file will be created with formatted data
The generated Excel file will contain the following columns:
- Purchased: Quantity purchased
- Received: Quantity received
- Code1: Primary product code (CAS/PK/BAG)
- Code2: Secondary product code
- Brand: Product brand
- Description: Product type/category
- Product: Full product description
- CostPerPacket: Cost per packet (currency formatted)
- TotalCost: Total cost (currency formatted)
- BarInParanthesis: Units per packet
- UnitCost: Cost per unit (currency formatted)
- Tentative: Calculated tentative price (currency formatted)
-
PDF Extraction Module (
src/pdf_extraction/extractor.py
):- Extracts text from PDF files
- Uses pdfplumber for text-based PDFs
- Uses pytesseract OCR for scanned PDFs
-
Text Processing Module (
src/text_processing/processor.py
):- Cleans and parses OCR text
- Handles detection of invoice items, quantities, codes, prices
- Contains OCR error correction logic
-
Excel Output Module (
src/excel_output/export.py
):- Formats and exports data to Excel
- Applies proper column formatting and width adjustments
-
GUI Module (
src/gui/app.py
):- Implements the user interface
- Provides file selection, processing status, and log display
-
Converter Module (
src/converter.py
):- Ties all components together
- Orchestrates the conversion process
To add new features:
- Identify which module should contain the functionality
- Implement your feature in the appropriate module
- Update the converter module if needed to integrate your changes
- Test thoroughly
To create the executable yourself:
pip install pyinstaller
pyinstaller --onefile --windowed --icon=NONE --name="Invoice_to_Excel" main.py
The executable will be created in the dist
directory.
-
Tesseract Error:
- Verify Tesseract is installed in
C:\Program Files\Tesseract-OCR
- Check if Tesseract is in system PATH
- Run
tesseract --version
to verify installation
- Verify Tesseract is installed in
-
PDF Not Reading:
- Ensure the PDF is not password protected
- Check if the PDF is readable (try opening in a PDF viewer)
- For scanned PDFs, ensure good image quality
-
Excel File Issues:
- Check if the output Excel file is not already open
- Verify you have write permissions in the output directory
- Ensure enough disk space is available
-
OCR Quality Issues:
- Ensure PDF scan quality is good
- Check if the PDF is properly oriented
- Verify Tesseract installation is complete with all language packs
For issues and feature requests, please create an issue in the repository.
This project is licensed under the MIT License - see the LICENSE file for details.