This repository implements The Document Intelligence Refinery – a production-inspired, multi-stage agentic pipeline for converting heterogeneous enterprise documents into structured, queryable, spatially-indexed knowledge.
- Triage Agent (
src/agents/triage.py): Profiles each document (origin type, layout complexity, language, domain hint, estimated extraction cost) using configurable thresholds and domain keywords fromrubric/extraction_rules.yaml, then writesDocumentProfileJSON to.refinery/profiles/. - Multi-Strategy Extraction (
src/agents/extractor.py,src/strategies/):- Fast text extraction (pdfplumber) with confidence scoring.
- Layout-aware extraction (Docling) for complex/table-heavy documents.
- Vision-augmented extraction (VLM via HTTP API) for scanned/low-confidence pages, with budget caps from config.
- Escalation guard and extraction ledger in
.refinery/extraction_ledger.jsonl. - Cached
ExtractedDocumentJSON snapshots in.refinery/extracted/so downstream stages can be re-run without re-extracting.
- Semantic Chunking & PageIndex:
- Normalized
ExtractedDocumentmodel andLDUchunks with stablecontent_hashidentifiers. - Config-driven chunking rules (token budgets, list handling) from
rubric/extraction_rules.yaml. - PageIndex tree to navigate long documents before vector search.
- Normalized
- Query Agent & Storage:
- LangGraph-based agent with
pageindex_navigate,semantic_search, andstructured_querytools. - Vector store (Chroma) for LDU embeddings in
.refinery/chroma/. - Structured
FactTablefor numeric facts in.refinery/facts.sqlite. - Every answer includes a
ProvenanceChain(document, page, bbox, content_hash).
- LangGraph-based agent with
src/models/– Core Pydantic models:document_profile.py–DocumentProfileand enums.extracted_document.py– normalized extraction schema.ldu.py– logical document units.page_index.py– PageIndex tree structures.provenance.py– provenance spans and chains.
src/agents/– Pipeline agents:triage.py– Triage Agent.extractor.py– ExtractionRouter.chunker.py– Semantic Chunking Engine (Phase 3).indexer.py– PageIndex builder (Phase 3).query_agent.py– LangGraph query interface (Phase 4).
src/strategies/– Extraction strategies:base.py– shared strategy interface.fast_text.py– pdfplumber-based extractor with confidence scoring.layout_docling.py– Docling-based layout-aware extractor.vision_vlm.py– VLM-based extractor with budget guard.
rubric/extraction_rules.yaml– Externalized thresholds, domain keywords, and chunking rules..refinery/– Runtime artifacts:profiles/–DocumentProfileJSON outputs.extracted/– cachedExtractedDocumentJSON snapshots.pageindex/– PageIndex JSON trees.extraction_ledger.jsonl– extraction trace and cost estimates.chroma/– persisted vector store for LDUs.facts.sqlite– SQLite-backed fact table for numeric/financial facts.
scripts/– Utility entrypoints:run_pipeline.py– programmatic end-to-end pipeline runner.vector_explorer.py– CLI explorer for the vector store (LDUs).chunk_from_extracted.py– run chunking + ingest from cached extraction only.export_markdown.py– export cached extraction to Markdown.
Using uv (recommended):
uv sync --all-extrasThis creates a virtual environment, installs the project in editable mode, and includes dev dependencies (pytest, reportlab, etc.). Then run commands with:
uv run pytest tests/ -v
uv run python -m scripts.run_pipeline ...Alternatively, with pip:
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -e ".[dev]"For most use cases, use the interactive shell wrapper:
./run_pipeline.shYou will be prompted for:
- PDF path – for example
data/Annual_Report_JUNE-2017.pdf. - Optional question – for full QA mode.
- Mode selection:
1) Triage only– run just the Triage Agent and emit aDocumentProfile.2) Extraction– run the multi-strategy extractor, update the extraction ledger, and cache anExtractedDocumentunder.refinery/extracted/.3) Full QA– end-to-end pipeline (triage → extraction → chunking → PageIndex → QA) with optional LDU preview from the vector DB.4) Chunk only– build LDUs and ingest into the vector store from an existing cached extraction (no re-extraction).5) Vector DB– explore stored chunks (LDUs) via a small CLI explorer (summary, list docs, show doc, search, raw dump).
All modes load configuration from rubric/extraction_rules.yaml and update artifacts in .refinery/.
The REST API lives in the doc_refiner_api folder (sibling to DocRefinery). Run it using DocRefinery’s environment (so the pipeline and its dependencies are available). From the workspace that contains both DocRefinery and doc_refiner_api:
cd DocRefinery
uv run python ../doc_refiner_api/run_api.pyThen open http://localhost:8000/docs for the interactive OpenAPI (Swagger) UI. Endpoints include:
GET /health,GET /ready– liveness and readinessGET /api/documents– list documents with statusPOST /api/documents– submit a PDF by path (body:{"path": "data/your.pdf"}); returns 202 withdoc_idGET /api/documents/{doc_id}/status– document status (pending | processing | ready | failed)GET /api/documents/{doc_id}/extraction– cached extraction JSONGET /api/documents/{doc_id}/markdown– human-readable MarkdownPOST /api/documents/{doc_id}/query– run a question (body:{"question": "..."}); returns answer and provenance
Set .env (see below) if the pipeline uses LLM/VLM backends.
LLM and VLM backends are configured via environment variables, which can be placed in a .env file at the project root. The code automatically loads .env using python-dotenv.
Example .env for OpenRouter-style backend:
LLM_BACKEND=openrouter
LLM_API_BASE=https://openrouter.ai/api/v1
LLM_API_KEY=sk-your-key
LLM_TEXT_MODEL=gpt-4o-mini
LLM_VISION_MODEL=gpt-4o-miniExample .env for Ollama:
LLM_BACKEND=ollama
LLM_API_BASE=http://localhost:11434
LLM_TEXT_MODEL=llama3.1:8b
LLM_VISION_MODEL=llama3.2-visionYou can switch providers or models by editing .env without touching the code.
You can also call the pipeline directly from Python (e.g., for integration tests or notebooks):
python scripts/run_pipeline.py data/your_doc.pdf \
--question "What are the main findings of this report?" \
--show-ldu-previewThis will:
- Print the
DocumentProfile. - Run the multi-strategy extractor with escalation and log to
.refinery/extraction_ledger.jsonl. - Build LDUs and a basic
PageIndex, ingest chunks into the vector store, and populate theFactTable. - Ask the query agent your question and output an answer plus a
ProvenanceChain(document, page, bbox, content_hash, snippet). - Optionally preview a few stored LDUs as they appear in the vector DB.
- Domain onboarding notes live in
DOMAIN_NOTES.md(Phase 0). - This project is intentionally modular so that:
- New document types can be onboarded via
extraction_rules.yaml. - Strategies can be swapped (e.g., different VLM provider) without touching the pipeline core.
- New document types can be onboarded via