A companion service for Papra that adds LLM-powered OCR to your document library.
Polls Papra for PDF documents, converts pages to images, sends them to any OpenAI-compatible vision API, and writes structured markdown back to Papra.
- Polling-based -- discovers and processes new PDFs automatically
- Any OpenAI-compatible API -- OpenRouter, Ollama, OpenAI, etc.
- Markdown output -- preserves headings, tables, lists, and document structure
- Persistent state -- tracks processed documents so work is never repeated
- First-run control -- choose whether to OCR all existing documents or only new uploads
After OCR, the first page text is sent back to the LLM to automatically extract:
- Document name -- generates a descriptive name from the document content (e.g. "scan_001.pdf" becomes "Acme Corp — Invoice #2026-01-035.pdf"). The format is configurable via
OCR_NAME_FORMAT - Document date -- sets the document date in Papra based on dates found in the text
- Tags -- matches document content against your existing Papra tags and applies any that fit
Each extraction type can be individually enabled or disabled via environment variables. All three are enabled by default.
A lightweight HTTP server runs alongside the poller, allowing you to trigger reprocessing of specific documents:
curl -X POST http://localhost:7777/reprocess/doc_abc123This removes the document from the processed list so it gets picked up on the next poll cycle. Useful for re-extracting metadata after updating your tags or fixing a bad OCR result.
Add to your Papra docker-compose.yml:
papra-ai:
image: ghcr.io/koljam/papra-ai:latest
environment:
PAPRA_API_URL: http://papra:1221
PAPRA_API_KEY: ppapi_your_key_here
PAPRA_ORG_ID: org_your_org_id
OCR_API_URL: https://openrouter.ai/api/v1
OCR_API_KEY: sk-or-v1-your_key_here
OCR_MODEL: mistralai/mistral-large-2512
ports:
- 7777:7777 # reprocess API
volumes:
- papra-ai-data:/app/data # persist processed state across restartsImportant: Set
DOCUMENTS_CONTENT_EXTRACTION_ENABLED=falsein your Papra configuration to disable the built-in Tesseract OCR. Running both Tesseract and papra-ai simultaneously will cause race conditions.
First run with many existing documents? Set
OCR_PROCESS_EXISTING=falseto skip them and only process documents uploaded after the service starts. You can always re-OCR later by deleting thedata/processed.jsonfile.
| Variable | Required | Default | Description |
|---|---|---|---|
PAPRA_API_URL |
yes | -- | Papra instance URL (e.g. http://papra:1221) |
PAPRA_API_KEY |
yes | -- | Papra API key with document read/write access |
PAPRA_ORG_ID |
yes | -- | Papra organization ID |
OCR_API_URL |
yes | -- | OpenAI-compatible vision API base URL |
OCR_API_KEY |
yes | -- | API key for the vision model provider |
OCR_MODEL |
no | mistralai/mistral-large-2512 |
Vision model identifier |
OCR_CONCURRENCY |
no | 3 |
Number of PDF pages processed in parallel per document |
OCR_IMAGE_SCALE |
no | 2 |
Scale factor for PDF page rendering |
OCR_POLL_INTERVAL_SECONDS |
no | 60 |
Seconds between polling for new documents |
OCR_PROCESS_EXISTING |
no | true |
Set to false to skip existing documents on first run |
OCR_EXTRACT_NAME |
no | true |
Extract document name from first page and rename |
OCR_NAME_FORMAT |
no | {sender} — {type} {detail} |
Name format template with placeholders |
OCR_EXTRACT_DATE |
no | true |
Extract document date from first page |
OCR_EXTRACT_TAGS |
no | true |
Match and apply existing Papra tags to documents |
API_PORT |
no | 7777 |
Port for the reprocess API server |
DATA_DIR |
no | ./data |
Directory for persisting processed document state |
LOG_LEVEL |
no | info |
Log level (debug, info, warn, error) |
To reprocess a single document:
curl -X POST http://localhost:7777/reprocess/DOCUMENT_IDTo re-process everything from scratch, stop the service, delete data/processed.json, and restart with OCR_PROCESS_EXISTING=true (the default).
pnpm install
pnpm dev # starts with hot reload (.env loaded automatically)
pnpm build # builds for production
pnpm docker:build # builds Docker imageMIT