Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
-
Updated
Dec 8, 2025 - Python
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Get your documents ready for gen AI
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
A Unified Toolkit for Deep Learning-Based Table Extraction
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
The metadata and text content extractor for almost every file type.
LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.
Python scripts to parse and structure invoice data from PDFs using OpenAI, Anthropic and Invofox APIs
This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.
A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.
Opinionated and Sophisticated Document Region Analyzer.
LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.
Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!
Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
Combining OCR for text extraction with LLMs for accurate, efficient document structuring.
Add a description, image, and links to the document-parsing topic page so that developers can more easily learn about it.
To associate your repository with the document-parsing topic, visit your repo's landing page and select "manage topics."