Turn any PDF or image into clean text — or a typed Pydantic model — in three lines.
Decoupled, LLM-agnostic document OCR + structured extraction. No web server, no vendor lock-in.
Try it in 30 seconds — no Python script needed:
pip install 'ocrcontext[paddle,cli]'
ocrcontext extract invoice.pdf
ocrcontext extract receipt.jpg --output jsonOr use the Python API:
from ocrcontext import Analyzer
result = Analyzer().analyze("invoice.pdf")
print(result.text)ocrcontext is the extraction core of a production document-analysis platform, lifted out of its FastAPI/Next.js stack into a pure, pip-installable library. It handles OCR engine routing, fidelity-first LLM cleanup, and schema-based structured extraction — and gets out of your way.
Structured invoice extraction from an image:
Digital PDF text extraction:
- Demo
- Install
- CLI
- Quick start (Python API)
- LangChain integration
- Built-in schemas
- How it routes a document
- Refinement modes
- Configuration
- Development
- License
Engines are opt-in so your base install stays small:
| Command | What you get |
|---|---|
pip install ocrcontext |
Digital PDFs only (PyMuPDF text-layer — no OCR, no GPU, no API key) |
pip install 'ocrcontext[paddle]' |
+ printed images & scanned PDFs (PaddleOCR, CPU/GPU) |
pip install 'ocrcontext[vision]' |
+ handwriting (Google Cloud Vision) |
pip install 'ocrcontext[cli]' |
+ terminal CLI (ocrcontext extract) |
pip install 'ocrcontext[all]' |
everything above |
Add an LLM provider for refinement and structured extraction:
pip install langchain-openai # or langchain-anthropic, langchain-ollama, ...Images and scanned PDFs require
[paddle]. Passing an image file to a barepip install ocrcontextraises anEngineErrorwith a clear install hint.
- Enable the Cloud Vision API in Google Cloud Console
- Create a service account key (JSON) under IAM & Admin → Service Accounts → Keys
- Export the path:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" # Linux/macOS
$env:GOOGLE_APPLICATION_CREDENTIALS = "C:\path\to\key.json" # PowerShellInstall the [cli] extra to use ocrcontext straight from the terminal — no Python script needed.
pip install 'ocrcontext[paddle,cli]'Extract plain text:
ocrcontext extract invoice.pdf
ocrcontext extract scan.png --output jsonExtract structured data with a built-in schema:
ocrcontext extract invoice.pdf --schema invoice
ocrcontext extract receipt.jpg --schema receipt
ocrcontext extract contract.pdf --schema contract
ocrcontext extract passport.jpg --schema idcard
ocrcontext extract lab_report.pdf --schema medicalChoose your LLM provider:
ocrcontext extract invoice.pdf --schema invoice \
--provider openai --model gpt-4o-mini
ocrcontext extract invoice.pdf --schema invoice \
--provider anthropic --model claude-haiku-4-5-20251001
ocrcontext extract invoice.pdf --schema invoice \
--provider ollama --model llama3.1All options:
ocrcontext extract FILE [OPTIONS]
--schema -s invoice | receipt | contract | idcard | medical
--lang -l Language code (default: en)
--handwriting Force handwriting engine
--refine auto (default) | yes | no
--output -o text (default) | json
--provider -p openai | anthropic | ollama | google
--model -m Model name (default: gpt-4o-mini)
from ocrcontext import Analyzer
result = Analyzer().analyze("document.pdf")
print(result.text) # extracted text
print(result.pages) # page count
print(result.text_source) # "pdf_text_layer"pip install 'ocrcontext[paddle]'from ocrcontext import Analyzer
result = Analyzer().analyze("scan.png")
print(result.text, result.confidence)If you have a CUDA-capable GPU, swap the CPU PaddlePaddle build for the GPU one and pass use_gpu=True:
pip install 'ocrcontext[paddle]'
pip install paddlepaddle-gpu # replaces the CPU build; pick the wheel that matches your CUDA versionfrom ocrcontext import Analyzer
analyzer = Analyzer(use_gpu=True)
result = analyzer.analyze("scan.png")
print(result.text, result.confidence)PaddleOCR is typically 5–10× faster on GPU for large documents or batch workloads. CPU (
use_gpu=False, the default) works out of the box with no extra steps.
Refinement fixes character-level OCR errors without paraphrasing, translating, or inventing. Emails, URLs, and IBANs are masked before the model sees them and restored verbatim after. Output that drifts too far from the source is rejected in favour of the raw OCR text.
pip install 'ocrcontext[paddle]' langchain-openai
export OPENAI_API_KEY="sk-..."from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini"), lang="en")
result = analyzer.analyze("scan.jpg")
print(result.text) # refined
print(result.raw_text) # original OCR output
print(result.refined) # TrueHand the analyzer a Pydantic schema and get a populated instance back.
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
from ocrcontext.schemas import Invoice
analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini", temperature=0))
invoice = analyzer.extract("invoice.pdf", schema=Invoice)
print(invoice.supplier_name, invoice.total_amount, invoice.currency)
for item in invoice.line_items:
print(item.description, item.quantity, item.unit_price)Define your own schema — field descriptions are the prompt:
from pydantic import BaseModel, Field
class ShippingLabel(BaseModel):
sender: str | None = Field(None, description="Sender full name and address")
recipient: str | None = Field(None, description="Recipient full name and address")
tracking_number: str | None = Field(None, description="Carrier tracking number")
label = analyzer.extract("label.jpg", schema=ShippingLabel)from langchain_ollama import ChatOllama
from ocrcontext import Analyzer
analyzer = Analyzer(llm=ChatOllama(model="llama3.1"))
result = analyzer.analyze("scan.png")
print(result.text)OCRContextLoader is a drop-in LangChain BaseLoader. It slots into any LangChain pipeline — RAG, document Q&A, agents — without glue code.
from ocrcontext.loaders import OCRContextLoader
# Digital PDF — no LLM needed
loader = OCRContextLoader("contract.pdf")
docs = loader.load() # -> [Document(page_content="...", metadata={...})]
print(docs[0].metadata)
# {
# "source": "contract.pdf",
# "text_source": "pdf_text_layer",
# "pages": 4,
# "confidence": 0.99,
# "refined": False,
# }
# Scanned PDF or image with LLM refinement
from langchain_openai import ChatOpenAI
loader = OCRContextLoader(
"scan.pdf",
llm=ChatOpenAI(model="gpt-4o-mini"),
lang="en",
refine=True,
)
docs = loader.load()
print(docs[0].page_content) # LLM-refined OCR text
print(docs[0].metadata["raw_text"]) # original OCR output before refinementIn a RAG pipeline:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from ocrcontext.loaders import OCRContextLoader
# 1. OCR the document
docs = OCRContextLoader("annual_report.pdf").load()
# 2. Chunk, embed, store
chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100).split_documents(docs)
vectorstore = InMemoryVectorStore.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
# 3. QA chain
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template(
"Answer using only the context below.\n\nContext: {context}\n\nQuestion: {question}"
)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
chain.invoke("What is the total revenue for Q3?")Five ready-to-use Pydantic schemas with system prompts, importable from ocrcontext.schemas.
Pass them directly to analyzer.extract() or the CLI --schema flag.
from ocrcontext.schemas import Invoice
invoice = analyzer.extract("invoice.pdf", schema=Invoice)
# invoice.supplier_name, .invoice_number, .invoice_date, .total_amount,
# .currency, .tax_id, .tax_rate, .line_items (list[LineItem])from ocrcontext.schemas import Receipt
receipt = analyzer.extract("receipt.jpg", schema=Receipt)
# receipt.store_name, .date, .time, .total_amount, .tax_amount,
# .subtotal, .payment_method, .currency, .items (list[ReceiptItem])from ocrcontext.schemas import Contract
contract = analyzer.extract("agreement.pdf", schema=Contract)
# contract.title, .effective_date, .expiration_date, .contract_value,
# .currency, .governing_law, .key_obligations,
# .parties (list[ContractParty] with .name, .role)Supports national_id, passport, driver_license, residence_permit.
from ocrcontext.schemas import IdCard
card = analyzer.extract("passport.jpg", schema=IdCard)
# card.document_type, .full_name, .date_of_birth, .gender,
# .nationality, .document_number, .issue_date, .expiry_date,
# .issuing_authority, .addressfrom ocrcontext.schemas import MedicalReport
report = analyzer.extract("lab_report.pdf", schema=MedicalReport)
# report.patient_name, .patient_dob, .report_date, .doctor_name,
# .institution, .diagnosis, .icd_codes (list[str]),
# .medications (list[Medication]), .notes ┌────────────────────────┐
│ Analyzer │
Document ──────> │ route by document type │
└────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Processing pipeline │
├─────────────────────────────────────────┤
│ 1. Digital PDF │
│ → PyMuPDF text layer │
│ (LLM refine auto-skipped) │
│ │
│ 2. Image / scanned PDF │
│ → PaddleOCR │
│ preprocess → coverage-first │
│ → line-band fallback │
│ │
│ 3. Handwriting (explicit or auto) │
│ → Google Cloud Vision │
│ → PaddleOCR if Vision empty │
│ │
│ 4. LLM refine (optional) │
│ fidelity-first · literal-safe │
│ │
│ 5. extract(schema) (optional) │
│ → typed Pydantic model │
└─────────────────────────────────────────┘
Multi-page documents are joined with --- Page N --- separators.
Handwriting step 3 is explicit-only by default; set auto_handwriting_fallback=True to enable automatic retry.
| Mode | When it's used |
|---|---|
conservative |
Scanned images — minimal char-level correction only |
layout |
Digital PDFs — reconstruct clean structure |
handwriting_layout |
Handwritten notes / lists / diagrams |
handwriting_prose |
Handwritten poems / paragraphs / letters |
Modes are auto-selected based on the document type and text content. The handwriting mode choice is driven by whether the text looks like a DIKW/pyramid diagram. All prompts are ported verbatim from the production pipeline.
Override manually:
from ocrcontext import Analyzer, RefinementMode
result = analyzer.analyze("scan.png", mode=RefinementMode.CONSERVATIVE)from ocrcontext import Analyzer, AnalyzerConfig
cfg = AnalyzerConfig(
lang="tr", # default document language
prefer_pdf_text_layer=True, # skip OCR when a text layer exists
auto_handwriting_fallback=False, # keep PaddleOCR as sole engine (default); set True to enable Vision fallback
refine_by_default=True, # auto-refine whenever an LLM is configured
)
analyzer = Analyzer(llm=..., config=cfg, use_gpu=False) # set use_gpu=True for CUDA-capable devicesgit clone https://github.com/BahadirKarsli/OCRContext
cd OCRContext
pip install -e '.[dev]'
pytest # runs without GPU or network — engines and LLM are faked
ruff check .See examples/ for runnable smoke tests (image OCR, structured extraction, PDF routing).
MIT © Bahadır Karslı