High-performance document extraction to markdown/json. Extract text, tables, images from PDFs with layout detection and structure preservation.
- Fast PDF extraction - 2x faster than alternatives using MuPDF backend
- Structure preservation - headings, lists, paragraphs detected automatically
- Layout detection - reading order, columns, semantic blocks
- Multiple backends - MuPDF (default, fast) or pdftext (permissive license)
- Office documents - DOCX, PPTX, XLSX support
- Multiple outputs - Markdown, JSON, HTML, Chunks (for RAG)
- Apple Silicon - MLX optimization for M1/M2/M3
- MCP server - Claude Desktop integration
pip install versoWith optional dependencies:
pip install verso[ocr] # OCR for scanned documents
pip install verso[office] # Office document support
pip install verso[mcp] # MCP server for Claude Desktop
pip install verso[full] # Everythingfrom verso import extract
# extract PDF to markdown
doc = extract("document.pdf")
print(doc.to_markdown())
# extract specific pages
from verso import Config
config = Config(page_range=[0, 1, 2])
doc = extract("document.pdf", config=config)
# iterate pages (memory efficient)
from verso import extract_pages
for page in extract_pages("large.pdf"):
print(page.to_markdown())# basic extraction
verso input.pdf -o output.md
# JSON output
verso input.pdf --format json -o output.json
# analyze document structure
verso --analyze input.pdf
# batch processing
verso ./input_dir/ -o ./output_dir/Benchmarks on Apple M2 (15-page academic PDF):
| Backend | Time | Per Page |
|---|---|---|
| MuPDF (default) | 417ms | 28ms |
| pdftext | 681ms | 45ms |
Structure processing adds <1ms overhead.
from verso import Config
config = Config(
# backend selection
pdf_backend="mupdf", # "mupdf" (fast) or "pdftext" (permissive)
# extraction options
extract_images=True,
extract_tables=True,
# structure processing
merge_blocks=True, # merge adjacent paragraphs
detect_headings=True, # font-based heading detection
detect_lists=True, # bullet/numbered list detection
reading_order=True, # column-aware reading order
# output
output_format="markdown", # markdown, json, html, chunks
)
doc = extract("document.pdf", config=config)config = Config.fast() # speed optimized, no images
config = Config.accurate() # full extraction with LLM
config = Config.ocr() # for scanned documentsdoc = extract("paper.pdf")
print(doc.to_markdown())Output:
## Introduction
This paper presents a novel approach to...
- First key point
- Second key point
### Methods
We conducted experiments using...doc = extract("paper.pdf")
data = doc.to_json(){
"filepath": "paper.pdf",
"pages": [
{
"page_id": 0,
"width": 612,
"height": 792,
"children": [
{"type": "heading", "level": 2, "text": "Introduction"},
{"type": "paragraph", "text": "This paper presents..."}
]
}
]
}from verso import Config
config = Config(output_format="chunks")
doc = extract("paper.pdf", config=config)
chunks = doc.to_chunks(){
"chunks": [
{
"id": "abc123",
"type": "paragraph",
"text": "Content...",
"page_id": 0,
"section_hierarchy": {"1": "Introduction", "2": "Background"}
}
]
}Automatically choose the best pipeline based on document analysis:
from verso import extract_smart, analyze_document
# analyze first
analysis = analyze_document("document.pdf")
print(f"Type: {analysis.content_type}")
print(f"Route: {analysis.recommended_route}")
print(f"OCR needed: {analysis.ocr_ratio:.0%} of pages")
# extract with smart routing
doc = extract_smart("document.pdf")Routes:
- FAST - digital PDFs, text extraction only
- STANDARD - layout detection for complex pages
- OCR - scanned documents
- HYBRID - mixed digital/scanned pages
- VLM - vision-language model for complex layouts
For Claude Desktop integration:
# start server
verso-mcpAdd to Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"verso": {
"command": "verso-mcp"
}
}
}On M1/M2/M3 Macs, verso uses MLX for acceleration:
pip install verso[apple]
verso --show-device
# Device: mlx (Apple Neural Engine + GPU)extract(source, config=None) -> Document
"""Extract content from a document."""
extract_pages(source, config=None) -> Iterator[Page]
"""Extract pages one at a time (memory efficient)."""
extract_smart(source, config=None) -> Document
"""Extract with intelligent routing."""
analyze_document(source, config=None) -> DocumentAnalysis
"""Analyze document without full extraction."""doc.to_markdown() -> str
doc.to_json() -> dict
doc.to_html() -> str
doc.to_chunks() -> dict
doc.pages -> list[Page]
len(doc) -> intpage.children -> list[Block]
page.width -> float
page.height -> float
page.page_id -> intheading- withlevel(1-6)paragraphlist- containslist_itemchildrentable- withrows,colsfigure- with optionalcaptioncode- with optionallanguageequation- with optionallatex
# clone
git clone https://github.com/example/verso.git
cd verso
# install with uv
uv sync --all-extras
# run tests
uv run pytest
# run tests with coverage
uv run pytest --cov
# lint and format
uv run ruff check src
uv run ruff format src
# type check
uv run mypy src/versojust install # install all dependencies
just test # run tests
just lint # run linter
just fmt # format code
just check # run all checks
just build # build packageSource → Provider → Pipeline → Structure → Renderer → Output
↓ ↓ ↓ ↓
MuPDF Analyzer Headings Markdown
pdftext Router Lists JSON
Office OCR Columns HTML
Image Layout Merge Chunks
See DESIGN.md for detailed architecture.
Run benchmarks:
uv run python scripts/benchmark.py
uv run python scripts/benchmark.py --full # includes marker comparisonMIT - see LICENSE
Contributions welcome! Please read CONTRIBUTING.md first.
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request