Skip to content

thuvasooriya/verso

Repository files navigation

verso

PyPI version Python 3.10+ License: MIT Tests

High-performance document extraction to markdown/json. Extract text, tables, images from PDFs with layout detection and structure preservation.

Features

  • Fast PDF extraction - 2x faster than alternatives using MuPDF backend
  • Structure preservation - headings, lists, paragraphs detected automatically
  • Layout detection - reading order, columns, semantic blocks
  • Multiple backends - MuPDF (default, fast) or pdftext (permissive license)
  • Office documents - DOCX, PPTX, XLSX support
  • Multiple outputs - Markdown, JSON, HTML, Chunks (for RAG)
  • Apple Silicon - MLX optimization for M1/M2/M3
  • MCP server - Claude Desktop integration

Installation

pip install verso

With optional dependencies:

pip install verso[ocr]      # OCR for scanned documents
pip install verso[office]   # Office document support
pip install verso[mcp]      # MCP server for Claude Desktop
pip install verso[full]     # Everything

Quick Start

Python API

from verso import extract

# extract PDF to markdown
doc = extract("document.pdf")
print(doc.to_markdown())

# extract specific pages
from verso import Config
config = Config(page_range=[0, 1, 2])
doc = extract("document.pdf", config=config)

# iterate pages (memory efficient)
from verso import extract_pages
for page in extract_pages("large.pdf"):
    print(page.to_markdown())

CLI

# basic extraction
verso input.pdf -o output.md

# JSON output
verso input.pdf --format json -o output.json

# analyze document structure
verso --analyze input.pdf

# batch processing
verso ./input_dir/ -o ./output_dir/

Performance

Benchmarks on Apple M2 (15-page academic PDF):

Backend Time Per Page
MuPDF (default) 417ms 28ms
pdftext 681ms 45ms

Structure processing adds <1ms overhead.

Configuration

from verso import Config

config = Config(
    # backend selection
    pdf_backend="mupdf",       # "mupdf" (fast) or "pdftext" (permissive)
    
    # extraction options
    extract_images=True,
    extract_tables=True,
    
    # structure processing
    merge_blocks=True,         # merge adjacent paragraphs
    detect_headings=True,      # font-based heading detection
    detect_lists=True,         # bullet/numbered list detection
    reading_order=True,        # column-aware reading order
    
    # output
    output_format="markdown",  # markdown, json, html, chunks
)

doc = extract("document.pdf", config=config)

Presets

config = Config.fast()      # speed optimized, no images
config = Config.accurate()  # full extraction with LLM
config = Config.ocr()       # for scanned documents

Output Formats

Markdown

doc = extract("paper.pdf")
print(doc.to_markdown())

Output:

## Introduction

This paper presents a novel approach to...

- First key point
- Second key point

### Methods

We conducted experiments using...

JSON

doc = extract("paper.pdf")
data = doc.to_json()
{
  "filepath": "paper.pdf",
  "pages": [
    {
      "page_id": 0,
      "width": 612,
      "height": 792,
      "children": [
        {"type": "heading", "level": 2, "text": "Introduction"},
        {"type": "paragraph", "text": "This paper presents..."}
      ]
    }
  ]
}

Chunks (for RAG)

from verso import Config

config = Config(output_format="chunks")
doc = extract("paper.pdf", config=config)
chunks = doc.to_chunks()
{
  "chunks": [
    {
      "id": "abc123",
      "type": "paragraph",
      "text": "Content...",
      "page_id": 0,
      "section_hierarchy": {"1": "Introduction", "2": "Background"}
    }
  ]
}

Intelligent Routing

Automatically choose the best pipeline based on document analysis:

from verso import extract_smart, analyze_document

# analyze first
analysis = analyze_document("document.pdf")
print(f"Type: {analysis.content_type}")
print(f"Route: {analysis.recommended_route}")
print(f"OCR needed: {analysis.ocr_ratio:.0%} of pages")

# extract with smart routing
doc = extract_smart("document.pdf")

Routes:

  • FAST - digital PDFs, text extraction only
  • STANDARD - layout detection for complex pages
  • OCR - scanned documents
  • HYBRID - mixed digital/scanned pages
  • VLM - vision-language model for complex layouts

MCP Server

For Claude Desktop integration:

# start server
verso-mcp

Add to Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "verso": {
      "command": "verso-mcp"
    }
  }
}

Apple Silicon

On M1/M2/M3 Macs, verso uses MLX for acceleration:

pip install verso[apple]
verso --show-device
# Device: mlx (Apple Neural Engine + GPU)

API Reference

Functions

extract(source, config=None) -> Document
    """Extract content from a document."""

extract_pages(source, config=None) -> Iterator[Page]
    """Extract pages one at a time (memory efficient)."""

extract_smart(source, config=None) -> Document
    """Extract with intelligent routing."""

analyze_document(source, config=None) -> DocumentAnalysis
    """Analyze document without full extraction."""

Document

doc.to_markdown() -> str
doc.to_json() -> dict
doc.to_html() -> str
doc.to_chunks() -> dict
doc.pages -> list[Page]
len(doc) -> int

Page

page.children -> list[Block]
page.width -> float
page.height -> float
page.page_id -> int

Block Types

  • heading - with level (1-6)
  • paragraph
  • list - contains list_item children
  • table - with rows, cols
  • figure - with optional caption
  • code - with optional language
  • equation - with optional latex

Development

# clone
git clone https://github.com/example/verso.git
cd verso

# install with uv
uv sync --all-extras

# run tests
uv run pytest

# run tests with coverage
uv run pytest --cov

# lint and format
uv run ruff check src
uv run ruff format src

# type check
uv run mypy src/verso

Using just

just install    # install all dependencies
just test       # run tests
just lint       # run linter
just fmt        # format code
just check      # run all checks
just build      # build package

Architecture

Source → Provider → Pipeline → Structure → Renderer → Output
           ↓           ↓          ↓           ↓
        MuPDF       Analyzer   Headings   Markdown
        pdftext     Router     Lists      JSON
        Office      OCR        Columns    HTML
        Image       Layout     Merge      Chunks

See DESIGN.md for detailed architecture.

Benchmarks

Run benchmarks:

uv run python scripts/benchmark.py
uv run python scripts/benchmark.py --full  # includes marker comparison

License

MIT - see LICENSE

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests and linting
  5. Submit a pull request

Acknowledgments

  • PyMuPDF - fast PDF parsing
  • pdftext - alternative PDF backend
  • surya - OCR and layout detection

About

pdf to markdown extractor

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors