🤝 This Blueprint was a result of an EleutherAI <> mozilla.ai collaboration, as part of their work on Open Datasets for LLM Training.
The tools & methods showcased in this blueprint were also part of EleutherAI's work on the Common Pile 0.1.
Parse and convert Documents with Docling
This blueprint guides you to convert various unstructured documents (PDFs, DOCX, HTML, etc.) to markdown, or other, formats using the Docling CLI or a locally-hosted demo UI, with special attention to OCR capabilities and image handling options.
- Quick-start
- How it Works
- Features & Configuration
- Hardware requirements
- Troubleshooting
- License
- Contributing
We have built a simple Graphical Interface demo of Docling to showcase some basic functionality. To utilize the full set of features, see section Local CLI for the full Docling experience! You can try the demo in two ways:
You can also run the demo locally. First, clone the repository:
git clone https://github.com/mozilla-ai/document-to-markdown.gitThen, navigate to the directory, create a virtual environment and install the requirements:
cd document-to-markdown/demo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtFinally, run the demo:
python app.pyThis will start a local server, and you can access the demo at http://127.0.0.1:7860.
Install Docling using pip:
pip install doclingBasic usage to convert a PDF to Markdown:
# Convert a local file
docling path/to/document.pdf
# Convert from a URL
docling https://arxiv.org/pdf/2408.09869For advanced OCR with multiple languages:
docling path/to/document.pdf --ocr-lang en,fr,deDocling is a document processing tool that parses various formats and provides a unified representation. The CLI simplifies access to its features:
- Document Parsing: Docling parses your document and extracts text, tables, images, and structure
- Layout Analysis: For PDFs, it analyzes page layout to determine reading order
- OCR Processing: For scanned documents, it applies OCR to extract text
- Markdown Conversion: The parsed document is converted to Markdown format
- Image Handling: Images can be embedded, referenced, or replaced with placeholders
Note: These are only a few samples of the full set of features of Docling! Visit https://github.com/docling-project/docling for an up-to-date list of all the features and configurations.
Docling supports multiple OCR engines:
# Specify languages
docling path/to/document.pdf --ocr-lang en,fr,de
# Disable OCR entirely
docling path/to/document.pdf --no-ocrdocling path/to/document.pdf --ocr-engine tesseract# Install RapidOCR first
pip install rapidocr_onnxruntime
# Then use it with Docling
docling path/to/document.pdf --ocr-engine rapidocr# Install OcrMac first
pip install ocrmac
# Then use it with Docling
docling path/to/document.pdf --ocr-engine ocrmacUsing the VLM Pipeline, we can use a Vision Language Model with SmolDocling to describe images:
docling path/to/document.pdf --pipeline vlm --vlm-model smoldoclingWe can also use EfficientNet-B0 Document Image Classifier to classify images:
docling path/to/document.pdf --enrich-picture-classesdocling path/to/document.pdf --enrich-codedocling path/to/document.pdf --enrich-formulaOn Apple Silicon Macs, this automatically uses MLX acceleration for better performance.
Control how images appear in your Markdown output:
docling path/to/document.pdf --image-mode embeddedEmbeds images directly in the Markdown file using Base64 encoding, creating a self-contained document.
docling path/to/document.pdf --image-mode referencedSaves images as separate files and references them using relative paths in the Markdown.
docling path/to/document.pdf --image-mode placeholderReplaces images with placeholder text in the Markdown.
Convert multiple files at once:
docling ./documents/ --from pdf --to md --output ./markdown_files- OS: Windows, macOS, or Linux
- Python 3.10 or higher
- Minimum RAM: 8GB
- Disk space: 4GB for models and dependencies
- GPU: optional
If you encounter OCR problems:
# Try a different OCR engine
docling path/to/document.pdf --ocr-engine tesseract
# Force OCR on the entire page
docling path/to/document.pdf --force-full-page-ocrThis project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.