PDF Table Extraction with OCR

A tool for extracting tables from PDF documents using image-based techniques and OCR.

Features

Loads PDF files and converts them to images
Detects tables in PDF pages using computer vision
Crops tables from pages with configurable padding
Identifies rows and columns using line detection
Extracts text using PaddleOCR with multi-language support
Exports table data to CSV files
Merges tables with identical structures into consolidated CSVs
Progress tracking with visual feedback
Comprehensive logging of operations

Key Features

Intelligent CSV Merging

Tables are grouped by column structure
Multiple merged files created for different table formats
Header row deduplication during merging
Preserves original table order within structure groups

Logging

Detailed operations log in table_extractor.log
Includes both application and OCR engine logs
Overwritten for each new execution

Progress Tracking

Real-time progress bars for:
- PDF loading
- Table detection
- OCR processing
- File saving
Clean completion message with timestamp

Output Notes

Tables with different column counts are saved in separate merged files
First table's header is used as reference for merging
Subsequent tables with matching headers are merged without header row
Tables with unique headers are preserved as-is
Empty rows/columns are automatically trimmed from CSV outputs

Installation

Clone the repository
Install the required dependencies:

pip install -r requirements.txt

or Using Astral UV

uv sync

Note: you'll need to install Astral UV first if not already installed.

Note for Windows users

For Windows, you might need to install Poppler separately:

Download the Poppler binary from: https://github.com/oschwartz10612/poppler-windows/releases/
Extract it to a folder (e.g., C:\poppler)
Add the bin directory to your PATH: C:\poppler\bin

Usage

Run the table extraction tool with:

python main.py path_to_your_pdf.pdf

Optional arguments

--output_dir, -o: Output directory (default: 'data/output')
--dpi: DPI for PDF to image conversion (default: 300)
--denoise/--no-denoise: Toggle image denoising (default: True)
--morph-close/--no-morph-close: Toggle morphological closing (default: True)
--crop-padding: Padding around cropped tables in pixels (default: 1)
--lang: OCR language code (default: 'en')
--use-gpu: Use GPU acceleration for OCR (requires compatible hardware)

Example:

python main.py financial_report.pdf --output_dir results --dpi 400 --lang en --use-gpu

Output Structure

The tool creates the following directory structure:

output
├── merged_table.csv      # All tables merged into single CSV
├── cropped/              # Cropped table images
│   ├── page_0_table_0.png
│   ├── page_0_table_1.png
│   └── ...
├── csv/                  # Individual table CSV files
│   ├── page_0_table_0.csv
│   ├── page_0_table_1.csv
│   └── ...
├── detected/             # Pages with table detection boxes
│   ├── page_0.png
│   ├── page_1.png
│   └── ...
├── original/             # Original page images
│   ├── page_0.png
│   ├── page_1.png
│   └── ...
└── structure/            # Tables with grid structure
    ├── page_0_table_0.png
    ├── page_0_table_1.png
    └── ...

Each subdirectory contains specific information about the processing pipeline:

original/: Original PDF pages as images
detected/: Pages with green boxes showing detected table regions
cropped/: Individual cropped tables from each page
structure/: Tables with detected grid structure (red lines)
csv/: CSV files containing the extracted text from each table
merged_table.csv: All tables combined into a single CSV file, ordered by page and table number

Acknowledgement

PaddleOCR for optical character recognition.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Table Extraction with OCR

Features

Key Features

Intelligent CSV Merging

Logging

Progress Tracking

Output Notes

Installation

Note for Windows users

Usage

Optional arguments

Output Structure

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

chingkhei-th/table-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Table Extraction with OCR

Features

Key Features

Intelligent CSV Merging

Logging

Progress Tracking

Output Notes

Installation

Note for Windows users

Usage

Optional arguments

Output Structure

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages