Convert academic-paper PDFs to clean Markdown with extracted PNG figures — ready for LLM / agent ingestion.
paper.pdf → paper_md/
├── paper.md # full text with  references
├── img_1.png
├── img_2.png
└── …
# Lightweight (PyMuPDF backend only)
pip install "paper2md @ git+https://github.com/expectedparrot/paper2md.git"
# With marker-pdf for academic-layout awareness (recommended)
pip install "paper2md[marker] @ git+https://github.com/expectedparrot/paper2md.git"
# With Reducto cloud API
pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git"# Auto-selects best available backend (prefers marker)
paper2md paper.pdf
# Explicit backend, custom output dir
paper2md paper.pdf --backend pymupdf --output ./out
# Adjust figure DPI (pymupdf backend only, default 150)
paper2md paper.pdf --backend pymupdf --dpi 200
# Print markdown to stdout
paper2md paper.pdf --printfrom paper2md.converter import convert
result = convert("paper.pdf")
result.markdown # full markdown string
result.images # {"img_1.png": Path(...), ...}
result.output_dir # Path to folder with paper.md + PNGs
result.backend_used # "marker" or "pymupdf"Choose a specific backend:
result = convert("paper.pdf", backend="pymupdf")
result = convert("paper.pdf", backend="marker")
result = convert("paper.pdf", output_dir="/tmp/my_paper")The Reducto backend is a separate module (not wired into the main convert() function):
pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git"
export REDUCTO_API_KEY=your_key_herefrom paper2md.reducto import convert_with_reducto
from pathlib import Path
md, images = convert_with_reducto(Path("paper.pdf"), Path("./out"))- The chosen backend converts the PDF to Markdown text and extracts any embedded images.
- All extracted figures are renamed to
img_1.png,img_2.png, etc. and saved to the output directory. - Image references (
) in the Markdown are rewritten to match the canonical filenames. - If the backend produced images that aren't referenced in the text (common with pymupdf), they are appended in an Extracted Figures section so no figure is lost.
- Non-ASCII characters common in academic PDFs (Greek letters, math symbols, special dashes) are transliterated to ASCII equivalents for LaTeX compatibility.
- The final
paper.md+ all PNGs are written to the output directory.
| Backend | Quality | Speed | Cost | Install |
|---|---|---|---|---|
| marker | Best | Slow | Free | pip install "paper2md[marker] @ git+https://github.com/expectedparrot/paper2md.git" |
| pymupdf | Good | Fast | Free | bundled |
| Reducto | Best | Fast | Paid | pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git" + API key |
marker-pdf is recommended for academic papers — it handles multi-column layouts, LaTeX equations, figure captions, and tables well. It is the default backend when installed.
pymupdf is bundled and works out of the box. Good for simple layouts.
Reducto is a cloud API option for high-volume or high-quality needs.