Digitize table scans using the Gemini API.
-
Install uv (fast Python package manager):
# Linux/macOS curl -LsSf https://astral.sh/uv/install.sh | sh # Windows PowerShell powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" # Or with pip pip install uv
-
Get a Gemini API Key: https://aistudio.google.com/app/api-keys
- If you stay within these limits, API usage is free.
- To go above these limits, you need to set up billing in Google Cloud (~300$ free credits after initial setup)
# 1. Create virtual environment
uv venv
# 2. Activate it
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
# 3. Install dependencies
uv pip install -r requirements.txt
# 4. Set API key & start UI
export GEMINI_API_KEY='your-key' # Linux/macOS (or set GEMINI_API_KEY=... on Windows)
cd ui && streamlit run app.pyOnce running at http://localhost:8501:
- Create a Prompt - Instructions and guidance for the LLM
- Create a Schema - Define the output columns
- Create a Project - Combine prompt + schema
- Upload PDFs - Add your documents to the project. All files in a project will use the same prompt/schema
- Process - Extract data from tables. Press "View" button of a file to inspect the data extracted from individual files.
If you want to use the functionalities directly in your code instead of the UI:
from table_ocr import ocr_pdf, create_batch_ocr_job
from google import genai
# Define your schema
schema = genai.types.Schema(
type=genai.types.Type.OBJECT,
properties={
"table": genai.types.Schema(
type=genai.types.Type.ARRAY,
items=genai.types.Schema(
type=genai.types.Type.OBJECT,
properties={
"name": genai.types.Schema(type=genai.types.Type.STRING),
"date": genai.types.Schema(type=genai.types.Type.STRING),
}
)
)
}
)
# Direct processing (fast, full cost)
results = ocr_pdf(
pdf_path="document.pdf",
prompt_template="Extract the table data",
response_schema=schema
)
# Batch processing (50% discount, ~24h processing time)
job_name = create_batch_ocr_job(
pdf_path="document.pdf",
prompt="Extract the table data",
response_schema=schema
)- The default model is Gemini-2.5-Flash-Lite. You can change the used model in config.py. Gemini-2.5-Flash likely delivers better performance at ~5x cost.
- Problems can arise when there are remains of the previous/next page on the left/right edge of scanned images. You can try to solve this via prompting, changing the
IMAGE_PROCESSING_CONFIGinconfig.pyto automatically crop sides, or manually cropping. - The UI stores data in the
ocr_data/directory at the repository root (created automatically)
Make sure you've activated your virtual environment:
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # WindowsInstall dependencies:
uv pip install -r requirements.txtSet your API key:
export GEMINI_API_KEY='your-key' # Linux/macOS
set GEMINI_API_KEY=your-key # Windows- Choose which results file is active for each file for the final export.
- Majority voting functionality! This can fix most OCR issues.
- Set processing config via UI
- Allow changing prompt in a project
- Enable non-tabular structured data extraction!
- Make OCR model interchangeable (other API providers/LiteLLM, or local models such as Marker)