Skip to content

A Docling extension for superior PDF/DOCX to Markdown conversion, featuring smart image understanding with Gemini VLM.

Notifications You must be signed in to change notification settings

amirkiarafiei/docling-processor

Repository files navigation

Document Processor using Docling

Sample python scripts for converting PDF/DOCX documents to Markdown format with enhanced image processing capabilities using Docling.

  • Adds a feature to the original Docling library to use Gemini models for describing images.

Features

  • PDF to Markdown conversion with image extraction
  • Two processing modes:
    • Local VLM (Vision Language Model) processing
    • Remote VLM processing via API
  • Automatic image captioning and description generation
  • Support for database schema visualization and description
  • High-quality image scaling and processing
  • CUDA acceleration support for faster processing

Requirements

  • Python 3.x
  • CUDA-capable GPU (recommended)
  • Required Python packages:
    • docling
    • docling_core
    • flask (for remote processing)
    • requests
    • python-dotenv
    • flashattention (required for local VLM processing)

Configuration

Environment Variables

For remote processing, you need to set up the following environment variable in your .env file:

GEMINI_API_KEY=your_api_key_here
GEMINI_MODEL_NAME=gemini_model_name

as well as this default value (also available in .env.example):

GEMINI_URL=https://generativelanguage.googleapis.com/v1/models/{model_name}:generateContent

Usage

Local Processing

  1. Place your PDF files in the input/ directory
  2. Open local_vlm_pdf_to_md.py and modify the input_doc_paths list to include your PDF files:
input_doc_paths = [
    Path("input/your-file.pdf"),
    # Add more files as needed
]
  1. Run the local processing script:
python local_vlm_pdf_to_md.py

The processed files will be saved in the output/ directory.

Remote Processing

  1. Start the proxy server:
python gemini_proxy.py

Note: The gemini_proxy.py is a Flask application that translates OpenAI-compatible payloads from docling to Gemini-compatible format.

  1. Place your PDF files in the input/ directory
  2. Open remote_vlm_pdf_to_md.py and modify the input_doc_paths list to include your PDF files:
input_doc_paths = [
    Path("input/your-file.pdf"),
    # Add more files as needed
]
  1. Run the remote processing script:
python remote_vlm_pdf_to_md.py

The processed files will be saved in the output/ directory.

NOTE: The same procecess can be applied to convert word files to markdown using the remote_vlm_word_to_md.py script.

Output Format

The processor generates Markdown files with:

  • Extracted text content
  • Embedded images
  • Image captions
  • Detailed image descriptions
  • Database schema analysis (when applicable)

Directory Structure

docling_processor/
├── input/          # Input PDF files
├── output/         # Generated Markdown files
├── prompt_templates/ # Prompts used by VLM
├── __init__.py
├── local_vlm_pdf_to_md.py # Local processing
├── remote_vlm_pdf_to_md.py # Remote processing
├──  gemini_proxy.py # Gemini proxy
├── README.md
└── .env

Notes

  • You can modify the VLM instructions (prompts) in the prompt_templates/tmf_images.txt directory. The default prompt is optimized for TMForum (TMF) documents.
  • The local processing mode uses the IBM Granite Vision model
  • Remote processing mode requires a running Gemini proxy server that translates OpenAI-compatible payloads to Gemini-compatible format
  • Image processing quality can be adjusted via the images_scale parameter
  • Processing time may vary based on document size and complexity
  • FlashAttention is required for optimal performance with local VLM processing

About

A Docling extension for superior PDF/DOCX to Markdown conversion, featuring smart image understanding with Gemini VLM.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages