Sample python scripts for converting PDF/DOCX documents to Markdown format with enhanced image processing capabilities using Docling.
- Adds a feature to the original Docling library to use Gemini models for describing images.
- PDF to Markdown conversion with image extraction
- Two processing modes:
- Local VLM (Vision Language Model) processing
- Remote VLM processing via API
- Automatic image captioning and description generation
- Support for database schema visualization and description
- High-quality image scaling and processing
- CUDA acceleration support for faster processing
- Python 3.x
- CUDA-capable GPU (recommended)
- Required Python packages:
- docling
- docling_core
- flask (for remote processing)
- requests
- python-dotenv
- flashattention (required for local VLM processing)
For remote processing, you need to set up the following environment variable in your .env
file:
GEMINI_API_KEY=your_api_key_here
GEMINI_MODEL_NAME=gemini_model_name
as well as this default value (also available in .env.example):
GEMINI_URL=https://generativelanguage.googleapis.com/v1/models/{model_name}:generateContent
- Place your PDF files in the
input/
directory - Open
local_vlm_pdf_to_md.py
and modify theinput_doc_paths
list to include your PDF files:
input_doc_paths = [
Path("input/your-file.pdf"),
# Add more files as needed
]
- Run the local processing script:
python local_vlm_pdf_to_md.py
The processed files will be saved in the output/
directory.
- Start the proxy server:
python gemini_proxy.py
Note: The gemini_proxy.py
is a Flask application that translates OpenAI-compatible payloads from docling to Gemini-compatible format.
- Place your PDF files in the
input/
directory - Open
remote_vlm_pdf_to_md.py
and modify theinput_doc_paths
list to include your PDF files:
input_doc_paths = [
Path("input/your-file.pdf"),
# Add more files as needed
]
- Run the remote processing script:
python remote_vlm_pdf_to_md.py
The processed files will be saved in the output/
directory.
NOTE: The same procecess can be applied to convert word files to markdown using the remote_vlm_word_to_md.py
script.
The processor generates Markdown files with:
- Extracted text content
- Embedded images
- Image captions
- Detailed image descriptions
- Database schema analysis (when applicable)
docling_processor/
├── input/ # Input PDF files
├── output/ # Generated Markdown files
├── prompt_templates/ # Prompts used by VLM
├── __init__.py
├── local_vlm_pdf_to_md.py # Local processing
├── remote_vlm_pdf_to_md.py # Remote processing
├── gemini_proxy.py # Gemini proxy
├── README.md
└── .env
- You can modify the VLM instructions (prompts) in the
prompt_templates/tmf_images.txt
directory. The default prompt is optimized for TMForum (TMF) documents. - The local processing mode uses the IBM Granite Vision model
- Remote processing mode requires a running Gemini proxy server that translates OpenAI-compatible payloads to Gemini-compatible format
- Image processing quality can be adjusted via the
images_scale
parameter - Processing time may vary based on document size and complexity
FlashAttention
is required for optimal performance with local VLM processing