Skip to content

phenschke/table-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table OCR

Digitize table scans using the Gemini API.

Quick Start

Prerequisites

  1. Install uv (fast Python package manager):

    # Linux/macOS
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # Windows PowerShell
    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
    
    # Or with pip
    pip install uv
  2. Get a Gemini API Key: https://aistudio.google.com/app/api-keys

Setup & Run (from the project root folder)

# 1. Create virtual environment
uv venv

# 2. Activate it
source .venv/bin/activate            # Linux/macOS
.venv\Scripts\activate               # Windows

# 3. Install dependencies
uv pip install -r requirements.txt

# 4. Set API key & start UI
export GEMINI_API_KEY='your-key'     # Linux/macOS (or set GEMINI_API_KEY=... on Windows)
cd ui && streamlit run app.py

Using the UI

Once running at http://localhost:8501:

  1. Create a Prompt - Instructions and guidance for the LLM
  2. Create a Schema - Define the output columns
  3. Create a Project - Combine prompt + schema
  4. Upload PDFs - Add your documents to the project. All files in a project will use the same prompt/schema
  5. Process - Extract data from tables. Press "View" button of a file to inspect the data extracted from individual files.

Programmatic Usage

If you want to use the functionalities directly in your code instead of the UI:

from table_ocr import ocr_pdf, create_batch_ocr_job
from google import genai

# Define your schema
schema = genai.types.Schema(
    type=genai.types.Type.OBJECT,
    properties={
        "table": genai.types.Schema(
            type=genai.types.Type.ARRAY,
            items=genai.types.Schema(
                type=genai.types.Type.OBJECT,
                properties={
                    "name": genai.types.Schema(type=genai.types.Type.STRING),
                    "date": genai.types.Schema(type=genai.types.Type.STRING),
                }
            )
        )
    }
)

# Direct processing (fast, full cost)
results = ocr_pdf(
    pdf_path="document.pdf",
    prompt_template="Extract the table data",
    response_schema=schema
)

# Batch processing (50% discount, ~24h processing time)
job_name = create_batch_ocr_job(
    pdf_path="document.pdf",
    prompt="Extract the table data",
    response_schema=schema
)

Notes

  • The default model is Gemini-2.5-Flash-Lite. You can change the used model in config.py. Gemini-2.5-Flash likely delivers better performance at ~5x cost.
  • Problems can arise when there are remains of the previous/next page on the left/right edge of scanned images. You can try to solve this via prompting, changing the IMAGE_PROCESSING_CONFIG in config.py to automatically crop sides, or manually cropping.
  • The UI stores data in the ocr_data/ directory at the repository root (created automatically)

Troubleshooting

"streamlit: command not found"

Make sure you've activated your virtual environment:

source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

"ModuleNotFoundError: No module named 'google'"

Install dependencies:

uv pip install -r requirements.txt

"GEMINI_API_KEY not set"

Set your API key:

export GEMINI_API_KEY='your-key'  # Linux/macOS
set GEMINI_API_KEY=your-key       # Windows

Future Improvements:

  • Choose which results file is active for each file for the final export.
  • Majority voting functionality! This can fix most OCR issues.
  • Set processing config via UI
  • Allow changing prompt in a project
  • Enable non-tabular structured data extraction!
  • Make OCR model interchangeable (other API providers/LiteLLM, or local models such as Marker)

About

Digitize table scans using the Gemini API

Resources

Stars

Watchers

Forks

Contributors