document-processing

Here are 276 public repositories matching this topic...

ucbepic / docetl

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Feb 2, 2026
Python

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Aug 27, 2025
Python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

ocr document-analysis document-processing scene-text-recognition scene-text-detection ocr-pytorch chineseocr document-parsing

Updated Feb 12, 2026
Python

dhlab-epfl / dhSegment

Star

Generic framework for historical document processing

tensorflow python3 segmentation historical-data document-processing

Updated Jul 9, 2021
Python

ucbepic / TWIX

Star

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

Updated Nov 26, 2025
Python

MantisAI / sieves

Sponsor

Star

Plug-and-play document AI with zero-shot models.

nlp machine-learning zero-shot-learning document-processing few-shot-learning llm generative-ai structured-generation

Updated Feb 7, 2026
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

awslabs / rhubarb

Star

A Python framework for multi-modal document understanding with Amazon Bedrock

multi-modal document-processing generative-ai intelligent-document-processing amazon-bedrock

Updated Feb 11, 2026
Python

Addepto / graph_builder

Star

Open-source toolkit to extract structured knowledge graphs from documents and tables — power analytics, digital twins, and AI-driven assistants.

cad graph-database graph-visualization graph-api semantic-search enterprise-knowledge-graph document-processing digital-twin knowledge-graph-construction fastapi pdf-table-extraction knowledge-graphs graph-extraction intelligent-document-processing intelligent-document-recognition rag-chatbot intelligent-document-processor

Updated Sep 15, 2025
Python

parsee-ai / parsee-core

Star

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

structured-data document-processing multimodal llm

Updated Jan 7, 2026
Python

jmanhype / DSPy-Multi-Document-Agents

Star

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

nlp distributed-systems ai query-optimization knowledge-management document-processing vector-search

Updated Nov 4, 2025
Python

vericle / intellyweave

Star

AI-powered platform for OSINT intelligence analysis. Features archive discovery with hypothesis-driven investigation, GLiNER entity extraction, Mapbox geospatial visualization, network analysis, and document processing. Built with FastAPI, Next.js, Weaviate, and DSPy.

Updated Jan 12, 2026
Python

seehiong / pdfusion

Star

A powerful PDF processing engine that deconstructs documents into their core elements—text, images, and tables—and seamlessly reconstructs them into pristine, structured Markdown. Built with a React frontend and a robust Python (PyMuPDF) backend on Appwrite.

react python markdown open-source pdf backend hackathon serverless-functions document-processing pymupdf appwrite

Updated Sep 10, 2025
Python

voelspriet / aiwhisperer

Star

DPG Campus Tool. Shrink massive PDFs to fit AI upload limits. Sanitize before uploading to reduce risk of exposing sensitive data.

nlp pdf privacy ocr osint ai text-extraction gemini ner claude investigation document-processing anonymization pii llm chatgpt notebooklm

Updated Jan 20, 2026
Python

belumume / claude-skills

Star

Personal collection of Claude skills - growing as I discover patterns and solve real-world problems

python claude document-processing anthropic claude-ai claude-skills

Updated Jan 31, 2026
Python

ucbepic / BARGAIN

Star

Low-Cost LLM-Powered Data Processing with Theoretical Guarantees

data ai document-processing llm

Updated Feb 4, 2026
Python

afrozas / proceedings

Star

Semantic extraction from conference proceedings.

semantic conferences spacy document-processing

Updated Jul 26, 2020
Python

martin-papy / qdrant-loader

Star

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

python openai developer-tools knowledge-base file-conversion enterprise-ready semantic-search multi-project cli-tool document-processing embbedings git-integration rag jira-integration cursor-ide llm-integration mcp-server confluence-integration

Updated Feb 9, 2026
Python

IBM / docling-graph

Star

Transform unstructured documents into validated, rich and queryable knowledge graphs.

ai convert knowledge-graph document-processing docling

Updated Feb 13, 2026
Python

MBAigner / PDFSegmenter

Star

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

python pdf csv table annotations cluster-analysis document-processing layout-analysis detection-model page-segmentation

Updated Sep 11, 2020
Python

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 276 public repositories matching this topic...

ucbepic / docetl

enoch3712 / ExtractThinker

Topdu / OpenOCR

dhlab-epfl / dhSegment

ucbepic / TWIX

MantisAI / sieves

iamarunbrahma / pdf-to-markdown

awslabs / rhubarb

Addepto / graph_builder

parsee-ai / parsee-core

jmanhype / DSPy-Multi-Document-Agents

vericle / intellyweave

seehiong / pdfusion

voelspriet / aiwhisperer

belumume / claude-skills

ucbepic / BARGAIN

afrozas / proceedings

martin-papy / qdrant-loader

IBM / docling-graph

MBAigner / PDFSegmenter

Improve this page

Add this topic to your repo