pdf-extraction

Star

Here are 378 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

Star

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Updated Jul 23, 2026
Java

xberg-io / xberg

Sponsor

Star

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Updated Jul 23, 2026
Rust

Zipstack / unstract

Star

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

ocr data-engineering idp ai-agents structured-output pdf-extraction document-ai llm prompt-engineering generative-ai mcp-server json-extraction

Updated Jul 23, 2026
Python

firecrawl / pdf-inspector

Star

Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.

nodejs python markdown rust pdf text-extraction pdf-parser pdf-extraction ocr-routing pdf-classification

Updated Jul 17, 2026
Rust

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit metadata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Jul 22, 2026
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated Jul 20, 2026
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated Jul 1, 2026

aiptimizer / TurboOCR

Star

TurboOCR, >200 img/s OmnidocBench. TensorRT FP16, PP-OCRv6, HTTP + gRPC

ocr grpc nvidia text-recognition text-detection inference-server fp16 tensorrt rag fastapi pdf-extraction paddleocr easyocr document-ai document-parsing qwen-vl gpu-ocr

Updated Jul 20, 2026
C++

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Jul 21, 2026
Java

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Feb 26, 2026
Python

iamarunbrahma / pdf-to-markdown

Star

Turn PDFs into clean, structured Markdown

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Jul 2, 2026
Python

appautomaton / document-SKILLs

Star

Claude Code and Codex SKILLs for PDF, Excel, Word, and PowerPoint manipulation — extraction, forms, formulas, tracked changes, adapted from Anthropic skills.

excel docx pptx codex ai-agents document-processing pdf-extraction agent-skills claude-code claude-skills

Updated Jul 1, 2026
Python

jztan / pdf-mcp

Star

An MCP server that lets Claude Code and other AI agents work through large PDFs without overflowing their context — search by meaning or keyword, read only the pages that matter, and cleanly pull out tables, images, and scanned text, even from multi-column and Japanese layouts.

python pdf ocr ai mcp opencode cjk copilot semantic-search table-extraction claude document-processing pymupdf pdf-extraction llm agentic-rag agentic-ai mcp-server codex-cli

Updated Jul 23, 2026
Python

NameetP / pdfmux

Star

Self-healing PDF extraction that flags what it can't read instead of dropping it — and now certifies any extractor's output, catching silently-dropped pages. #2 of all tools, #1 free on opendataloader-bench (0.903). 7-tool MCP. MIT, free.

python pdf ocr mcp self-healing structured-extraction rag pdf-to-json pdf-extraction ai-agent llm document-parsing pdf-to-markdown docling opendataloader

Updated Jul 22, 2026
Python

heleninsights-dot / phd-deepread-workflow

Star

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

python pdf workflow research academic obsidian literature-review pdf-extraction

Updated Jun 30, 2026
Python

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

YounesBensafia / arxiv-reader-mcp

Star

Want to search arXiv papers, fetch metadata, and extract full-text PDFs without leaving your editor? This MCP server connects any MCP-compatible client (Claude Code, etc.) directly to arXiv.

python ai mcp arxiv-api research-papers pdf-extraction arxiv-papers model-context-protocol mcp-server

Updated Jun 7, 2026
Python

wszqkzqk / qt-web-extractor

Star

Multimodal web content extraction engine backed by Qt WebEngine.

mcp chromium web-scraping qtwebengine content-extraction headless-browser pdf-extraction pyside6 open-webui mcp-server

Updated Jul 21, 2026
Python

F2-AI-Inc / docray

Star

X-ray for documents: lossless PDF & PPTX extraction to JSON with bounding boxes, fonts, and colors — CLI, HTTP API, and an in-browser playground. Rust + PDFium.

rust pdf json powerpoint presentations pptx bounding-boxes pdfium rag pdf-extraction llm document-parsing

Updated Jul 23, 2026
HTML

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Jul 15, 2026
Rust

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 378 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

xberg-io / xberg

Zipstack / unstract

firecrawl / pdf-inspector

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

aiptimizer / TurboOCR

ExtractPDF4J / ExtractPDF4J

mateogon / pdf-narrator

iamarunbrahma / pdf-to-markdown

appautomaton / document-SKILLs

jztan / pdf-mcp

NameetP / pdfmux

heleninsights-dot / phd-deepread-workflow

pcschreiber1 / PDF_Extraction-Translation

YounesBensafia / arxiv-reader-mcp

wszqkzqk / qt-web-extractor

F2-AI-Inc / docray

clark-labs-inc / pdfsink-rs

Improve this page

Add this topic to your repo