verso

High-performance document extraction to markdown/json. Extract text, tables, images from PDFs with layout detection and structure preservation.

Features

Fast PDF extraction - 2x faster than alternatives using MuPDF backend
Structure preservation - headings, lists, paragraphs detected automatically
Layout detection - reading order, columns, semantic blocks
Multiple backends - MuPDF (default, fast) or pdftext (permissive license)
Office documents - DOCX, PPTX, XLSX support
Multiple outputs - Markdown, JSON, HTML, Chunks (for RAG)
Apple Silicon - MLX optimization for M1/M2/M3
MCP server - Claude Desktop integration

Installation

pip install verso

With optional dependencies:

pip install verso[ocr]      # OCR for scanned documents
pip install verso[office]   # Office document support
pip install verso[mcp]      # MCP server for Claude Desktop
pip install verso[full]     # Everything

Quick Start

Python API

from verso import extract

# extract PDF to markdown
doc = extract("document.pdf")
print(doc.to_markdown())

# extract specific pages
from verso import Config
config = Config(page_range=[0, 1, 2])
doc = extract("document.pdf", config=config)

# iterate pages (memory efficient)
from verso import extract_pages
for page in extract_pages("large.pdf"):
    print(page.to_markdown())

CLI

# basic extraction
verso input.pdf -o output.md

# JSON output
verso input.pdf --format json -o output.json

# analyze document structure
verso --analyze input.pdf

# batch processing
verso ./input_dir/ -o ./output_dir/

Performance

Benchmarks on Apple M2 (15-page academic PDF):

Backend	Time	Per Page
MuPDF (default)	417ms	28ms
pdftext	681ms	45ms

Structure processing adds <1ms overhead.

Configuration

from verso import Config

config = Config(
    # backend selection
    pdf_backend="mupdf",       # "mupdf" (fast) or "pdftext" (permissive)
    
    # extraction options
    extract_images=True,
    extract_tables=True,
    
    # structure processing
    merge_blocks=True,         # merge adjacent paragraphs
    detect_headings=True,      # font-based heading detection
    detect_lists=True,         # bullet/numbered list detection
    reading_order=True,        # column-aware reading order
    
    # output
    output_format="markdown",  # markdown, json, html, chunks
)

doc = extract("document.pdf", config=config)

Presets

config = Config.fast()      # speed optimized, no images
config = Config.accurate()  # full extraction with LLM
config = Config.ocr()       # for scanned documents

Output Formats

Markdown

doc = extract("paper.pdf")
print(doc.to_markdown())

Output:

## Introduction

This paper presents a novel approach to...

- First key point
- Second key point

### Methods

We conducted experiments using...

JSON

doc = extract("paper.pdf")
data = doc.to_json()

{
  "filepath": "paper.pdf",
  "pages": [
    {
      "page_id": 0,
      "width": 612,
      "height": 792,
      "children": [
        {"type": "heading", "level": 2, "text": "Introduction"},
        {"type": "paragraph", "text": "This paper presents..."}
      ]
    }
  ]
}

Chunks (for RAG)

from verso import Config

config = Config(output_format="chunks")
doc = extract("paper.pdf", config=config)
chunks = doc.to_chunks()

{
  "chunks": [
    {
      "id": "abc123",
      "type": "paragraph",
      "text": "Content...",
      "page_id": 0,
      "section_hierarchy": {"1": "Introduction", "2": "Background"}
    }
  ]
}

Intelligent Routing

Automatically choose the best pipeline based on document analysis:

from verso import extract_smart, analyze_document

# analyze first
analysis = analyze_document("document.pdf")
print(f"Type: {analysis.content_type}")
print(f"Route: {analysis.recommended_route}")
print(f"OCR needed: {analysis.ocr_ratio:.0%} of pages")

# extract with smart routing
doc = extract_smart("document.pdf")

Routes:

FAST - digital PDFs, text extraction only
STANDARD - layout detection for complex pages
OCR - scanned documents
HYBRID - mixed digital/scanned pages
VLM - vision-language model for complex layouts

MCP Server

For Claude Desktop integration:

# start server
verso-mcp

Add to Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "verso": {
      "command": "verso-mcp"
    }
  }
}

Apple Silicon

On M1/M2/M3 Macs, verso uses MLX for acceleration:

pip install verso[apple]
verso --show-device
# Device: mlx (Apple Neural Engine + GPU)

API Reference

Functions

extract(source, config=None) -> Document
    """Extract content from a document."""

extract_pages(source, config=None) -> Iterator[Page]
    """Extract pages one at a time (memory efficient)."""

extract_smart(source, config=None) -> Document
    """Extract with intelligent routing."""

analyze_document(source, config=None) -> DocumentAnalysis
    """Analyze document without full extraction."""

Document

doc.to_markdown() -> str
doc.to_json() -> dict
doc.to_html() -> str
doc.to_chunks() -> dict
doc.pages -> list[Page]
len(doc) -> int

Page

page.children -> list[Block]
page.width -> float
page.height -> float
page.page_id -> int

Block Types

heading - with level (1-6)
paragraph
list - contains list_item children
table - with rows, cols
figure - with optional caption
code - with optional language
equation - with optional latex

Development

# clone
git clone https://github.com/example/verso.git
cd verso

# install with uv
uv sync --all-extras

# run tests
uv run pytest

# run tests with coverage
uv run pytest --cov

# lint and format
uv run ruff check src
uv run ruff format src

# type check
uv run mypy src/verso

Using just

just install    # install all dependencies
just test       # run tests
just lint       # run linter
just fmt        # format code
just check      # run all checks
just build      # build package

Architecture

Source → Provider → Pipeline → Structure → Renderer → Output
           ↓           ↓          ↓           ↓
        MuPDF       Analyzer   Headings   Markdown
        pdftext     Router     Lists      JSON
        Office      OCR        Columns    HTML
        Image       Layout     Merge      Chunks

See DESIGN.md for detailed architecture.

Benchmarks

Run benchmarks:

uv run python scripts/benchmark.py
uv run python scripts/benchmark.py --full  # includes marker comparison

License

MIT - see LICENSE

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Fork the repository
Create a feature branch
Make your changes
Run tests and linting
Submit a pull request

Acknowledgments

PyMuPDF - fast PDF parsing
pdftext - alternative PDF backend
surya - OCR and layout detection

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
scripts		scripts
src/verso		src/verso
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

verso

Features

Installation

Quick Start

Python API

CLI

Performance

Configuration

Presets

Output Formats

Markdown

JSON

Chunks (for RAG)

Intelligent Routing

MCP Server

Apple Silicon

API Reference

Functions

Document

Page

Block Types

Development

Using just

Architecture

Benchmarks

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

verso

Features

Installation

Quick Start

Python API

CLI

Performance

Configuration

Presets

Output Formats

Markdown

JSON

Chunks (for RAG)

Intelligent Routing

MCP Server

Apple Silicon

API Reference

Functions

Document

Page

Block Types

Development

Using just

Architecture

Benchmarks

License

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages