Local PDF & EPUB to Markdown converter with automatic digital/scanned detection, OCR support, smart splitting, and page-range selection. Converts books to clean, self-contained Markdown files with embedded images using Marker (95.67% accuracy) and Surya OCR (90+ languages). No cloud APIs — runs entirely on your hardware.
sahaf_demo.mp4
- PDF & EPUB support — handles both formats natively
- Automatic PDF classification — detects digital, scanned, or mixed PDFs via PyMuPDF
- High-accuracy conversion — Marker with 95.67% benchmark accuracy
- Built-in OCR — Surya OCR supports 90+ languages (Turkish, English, Arabic, etc.)
- Page/chapter range selection — convert only a specific section of the book (e.g. pages 19-88)
- Smart splitting — split output into N parts, cutting at heading/paragraph boundaries instead of mid-sentence
- Self-contained output — images embedded as base64 directly in Markdown, no separate files
- Split preview — see exactly how parts will be divided before downloading
- Bilingual UI — Turkish / English interface with one-click toggle
- Dark/light theme — lavender-toned design, persistent toggle
- Drag & drop UI — clean single-page web interface
pip install sahafOr from source:
git clone https://github.com/arikusi/sahaf.git
cd sahaf
pip install -e .Marker models (~2-3GB) are downloaded automatically on first conversion.
sahafOpen http://localhost:8000 in your browser.
- Upload — drag & drop a PDF or EPUB file
- Classify — PyMuPDF analyzes PDF type; EPUB chapters are counted
- Select range (optional) — pick specific pages or chapters to convert
- Convert — Marker processes PDF; ebooklib + markdownify handles EPUB
- Split (optional) — choose how many parts to split the output into
- Download — get a single
.mdor a ZIP with split parts, all images embedded inline
| Method | Path | Description |
|---|---|---|
POST |
/api/upload |
Upload PDF/EPUB, returns task_id |
GET |
/api/classify/{task_id} |
Detect PDF type + page count, or EPUB chapter count |
POST |
/api/convert/{task_id}?page_from=&page_to= |
Start conversion (optional page range) |
GET |
/api/status/{task_id} |
Poll conversion progress |
GET |
/api/result/{task_id} |
Get markdown + image list |
GET |
/api/download/{task_id} |
Download .md with embedded images |
GET |
/api/download/{task_id}/zip?parts=N |
Download ZIP with N split .md files |
GET |
/api/split-preview/{task_id}?parts=N |
Preview split structure before download |
- Backend: FastAPI + Uvicorn
- PDF Classification: PyMuPDF
- PDF Conversion: Marker (marker-pdf) + Surya OCR
- EPUB Conversion: ebooklib + markdownify
- Smart Splitting: Custom algorithm — heading/HR/paragraph boundary detection
- Frontend: Vanilla HTML/CSS/JS + marked.js
- i18n: TR/EN with client-side toggle
- Python 3.10+
- 4-6GB RAM (when Marker models are loaded)
- GPU strongly recommended for PDF — CPU-only is extremely slow (~1 hour for a 27-page mixed PDF on i5 + 40GB RAM). A CUDA-capable GPU converts the same file in minutes.
- EPUB conversion is lightweight — no GPU needed, runs instantly
GPL-3.0