Skip to content

opendataloader-project/opendataloader-pdf

OpenDataLoader PDF

License Java Python Maven Central PyPI version npm version GHCR Version Coverage CLA assistant


Safe, Open, High-Performance — PDF for AI

OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).

It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.


🌟 Key Features

  • 🧾 Rich, Structured Output — JSON, Markdown or Html
  • 🧩 Layout Reconstruction — Headings, Lists, Tables, Images, Reading Order
  • Fast & Lightweight — Rule-Based Heuristic, High-Throughput, No GPU
  • 🔒 Local-First Privacy — Runs fully on your machine
  • 🏷️ Tagged PDF — Advanced data extraction technology based on Tagged PDF - Learn more
  • 🛡️ AI-Safety — Auto-Filters likely prompt-injection content - Learn more
  • 📊 Benchmark — Continuously researched to deliver HIGH-QUALITY extraction with LOW ENERGY use - Learn more
  • 🖍️ Annotated PDF Visualization — See detected structures overlaid on the original - See examples

Annotated PDF Preview


🚀 Upcoming Features

Scheduled for December

  • 🖨️ OCR for scanned PDFs — Extract data from image-only pages.
  • 🧠 Table AI option — Higher accuracy for tables with borderless or merged cells.

Quick Start with Python

Prerequisites

  • Java 11 or higher must be installed and available in your system's PATH.
  • Python 3.9+

Installation

pip install -U opendataloader-pdf

Usage

input_path can be either the path to a single document or the path to a folder.

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["path/to/document.pdf", "path/to/folder"],
    output_dir="path/to/output",
    format="json,html,pdf,markdown"
)

Quick Start with more languages & tools


Developing with OpenDataLoader


🤝 Contributing

We believe that great software is built together.

Your contributions are vital to the success of this project.

Please read CONTRIBUTING.md for details on how to contribute.


💖 Community & Support

Have questions or need a little help? We're here for you!🤗


✨ Our Branding and Trademarks

We love our brand and want to protect it!

This project may contain trademarks, logos, or brand names for our products and services.

To ensure everyone is on the same page, please remember these simple rules:

  • Authorized Use: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
  • No Confusion: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
  • Third-Party Brands: Any use of trademarks or logos from other companies must follow that company’s specific policies.

⚖️ License

This project is licensed under the Mozilla Public License 2.0.

For the full license text, see LICENSE.

For information on third-party libraries and components, see: