GitHub - docuglean-ai/docuglean-ocr: Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.

Intelligent document processing using State of the Art AI models.

If you find Docuglean helpful, please ⭐ this repository to show your support!

What is Docuglean?

Docuglean is a unified SDK for intelligent document processing using State of the Art AI models. Docuglean provides multilingual and multimodal capabilities with plug-and-play APIs for document OCR, structured data extraction, annotation, classification, summarization, and translation. It also comes with inbuilt tools and supports different types of documents out of the box.

Features

🚀 Easy to Use: Simple, intuitive API with detailed documentation. Just pass in a file and get markdown in response.
🔍 OCR Capabilities: Extract text from images and scanned documents
📊 Structured Data Extraction: Use Zod/Pydantic schemas for type-safe structured data extraction
📄 Multimodal Support: Process PDFs and images with ease
🤖 Multiple AI Providers: Support for OpenAI, Mistral, and Google Gemini, with more coming soon
🔒 Type Safety: Full TypeScript support with comprehensive types
summarize: Get structured TLDRs of long documents
local OCR (PDF): Parse PDFs locally without calling external APIs with bounding box support

Available SDKs

📦 Node.js/TypeScript SDK

Package: docuglean-ocr

npm install docuglean-ocr

Repository: node-ocr/

Quick Start:

OCR Function - Pure OCR Processing Extracts text from documents and images, returning content and metadata like bounding boxes (provider-dependent).

import { ocr, extract } from 'docuglean-ocr';

// Extract raw text from documents (supports URLs and local files)
const ocrResult = await ocr({
  filePath: 'https://arxiv.org/pdf/2302.12854',
  provider: 'openai',
  model: 'gpt-4o-mini',
  apiKey: 'your-api-key'
});

Extract Function - Structured Data Extraction Extracts structured data from documents using custom schemas. Also handles summarization via custom prompts and a compact schema.

import { z } from 'zod';

// Define schema for structured extraction
const ReceiptSchema = z.object({
  date: z.string(),
  total: z.number(),
  items: z.array(z.object({
    name: z.string(),
    price: z.number()
  }))
});

// Extract structured data from documents
const extractResult = await extract({
  filePath: './receipt.pdf',
  provider: 'mistral',
  model: 'mistral-small-latest',
  apiKey: 'your-api-key',
  responseFormat: ReceiptSchema,
  prompt: 'Extract receipt details including date, total, and items'
});
// Summarization via extract
const SummarySchema = z.object({
  title: z.string().optional(),
  summary: z.string().min(50),
  keyPoints: z.array(z.string()).min(3).max(7),
});
const summary = await extract({
  filePath: './long-report.pdf',
  provider: 'openai',
  apiKey: 'your-api-key',
  responseFormat: SummarySchema,
  prompt: 'Provide a concise 3-sentence summary of this document and 3–7 key points.'
});
console.log('Summary:', summary.summary);

Note: you can also use extract with a targeted "search" prompt (e.g., "Find all occurrences of X and return matching passages") to perform semantic search within a document.

🐍 Python SDK

Package: docuglean-ocr

pip install docuglean-ocr

Repository: python-ocr/

Quick Start:

OCR Function - Pure OCR Processing Extracts text from documents and images, returning content and metadata like bounding boxes (provider-dependent).

from docuglean import ocr, extract

# Extract raw text from documents (supports URLs and local files)
ocr_result = await ocr(
    file_path="./test/data/testocr.png",
    provider="gemini",
    model="gemini-2.5-flash",
    api_key="your-api-key"
)

Extract Function - Structured Data Extraction Extracts structured data from documents using custom schemas. Requires a response format schema and returns parsed data.

from pydantic import BaseModel
from typing import List

# Define schema for structured extraction
class Item(BaseModel):
    name: str
    price: float

class Receipt(BaseModel):
    date: str
    total: float
    items: List[Item]

# Extract structured data from documents
extract_result = await extract(
    file_path="./receipt.pdf",
    provider="mistral",
    model="mistral-small-latest",
    api_key="your-api-key",
    response_format=Receipt,
    prompt="Extract receipt details including date, total, and items"
)

Coming Soon

🏷️ classify(): Document type classifier (receipt, ID, invoice, etc.)
🤖 More Models. More Providers: Integration with Meta's Llama, Together AI, OpenRouter and lots more.
🌍 Multilingual: Support for multiple languages
🎯 Smart Classification: Automatic document type detection

Provider Options

Currently supported providers and models:

OpenAI: gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-3.5-turbo, o1-mini, o1-preview
Mistral: mistral-ocr-latest, mistral-small-latest, ministral-8b-latest
Google Gemini: gemini-2.5-flash, gemini-2.5-pro, gemini-1.5-flash, gemini-1.5-pro
Hugging Face: Qwen/Qwen2.5-VL-3B-Instruct and other vision-language models (Python only)

Development

Node.js SDK

cd node-ocr
npm install
npm run build
npm test

Python SDK

cd python-ocr
uv sync
uv run pytest

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

Apache 2.0 - see the LICENSE file for details.

Stay Up to Date

⭐ Star this repo to get notified about new releases and updates!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
node-ocr		node-ocr
python-ocr		python-ocr
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intelligent document processing using State of the Art AI models.

If you find Docuglean helpful, please ⭐ this repository to show your support!

What is Docuglean?

Features

Available SDKs

📦 Node.js/TypeScript SDK

🐍 Python SDK

Coming Soon

Provider Options

Development

Node.js SDK

Python SDK

Contributing

License

Stay Up to Date

About

Uh oh!

Releases

Packages

Languages

License

docuglean-ai/docuglean-ocr

Folders and files

Latest commit

History

Repository files navigation

Intelligent document processing using State of the Art AI models.

If you find Docuglean helpful, please ⭐ this repository to show your support!

What is Docuglean?

Features

Available SDKs

📦 Node.js/TypeScript SDK

🐍 Python SDK

Coming Soon

Provider Options

Development

Node.js SDK

Python SDK

Contributing

License

Stay Up to Date

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages