Skip to content

lumina-ai-inc/chunkr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

Chunkr | Open Source Document Intelligence API

Production-ready API service for document layout analysis, OCR, and semantic chunking.
Convert PDFs, PPTs, Word docs & images into RAG/LLM-ready chunks.

Layout Analysis | OCR + Bounding Boxes | Structured HTML and markdown | VLM Processing controls

Try it out     Report Bug     Contact     Discord     Ask DeepWiki

Table of Contents

(Super) Quick Start

  1. Go to chunkr.ai
  2. Make an account and copy your API key
  3. Install our Python SDK:
pip install chunkr-ai
  1. Use the SDK to process your documents:
from chunkr_ai import Chunkr

# Initialize with your API key from chunkr.ai
chunkr = Chunkr(api_key="your_api_key")

# Upload a document (URL or local file path)
url = "https://chunkr-web.s3.us-east-1.amazonaws.com/landing_page/input/science.pdf"
task = chunkr.upload(url)

# Export results in various formats
html = task.html(output_file="output.html")
markdown = task.markdown(output_file="output.md")
content = task.content(output_file="output.txt")
task.json(output_file="output.json")

# Clean up
chunkr.close()

Documentation

Visit our docs for more information and examples.

OpenSource vs Commercial API vs Enterprise

Feature Open Source Commercial API Enterprise
Perfect for Development & testing Production applications Large-scale/High security deployments
Layout Analysis Basic models Advanced models Advanced + custom-tuned
OCR Accuracy Standard models Premium models Premium + domain-tuned
VLM Processing Basic vision models Enhanced VLM models Enhanced + custom fine-tunes
Excel Support ✅ Native parser ✅ Native parser
Document Types PDF, PPT, Word, Images PDF, PPT, Word, Images, Excel PDF, PPT, Word, Images, Excel
Infrastructure Self-hosted Fully managed Fully managed (On-prem or Chunkr-hosted)
Support Discord community Priority email + community 24/7 dedicated founing team support
Migration Support Community resources Documentation + email Dedicated migration team

Quick Start with Docker Compose

  1. Prerequisites:

  2. Clone the repo:

git clone https://github.com/lumina-ai-inc/chunkr
cd chunkr
  1. Set up environment variables:
# Copy the example environment file
cp .env.example .env

# Configure your llm models
cp models.example.yaml models.yaml

For more information on how to set up LLMs, see here.

  1. Start the services:
# For GPU deployment:
docker compose up -d

# For CPU-only deployment:
docker compose -f compose.yaml -f compose.cpu.yaml up -d

# For Mac ARM architecture (M1, M2, M3, etc.):
docker compose -f compose.yaml -f compose.cpu.yaml -f compose.mac.yaml up -d
  1. Access the services:

    • Web UI: http://localhost:5173
    • API: http://localhost:8000
  2. Stop the services when done:

# For GPU deployment:
docker compose down

# For CPU-only deployment:
docker compose -f compose.yaml -f compose.cpu.yaml down

# For Mac ARM architecture (M1, M2, M3, etc.):
docker compose -f compose.yaml -f compose.cpu.yaml -f compose.mac.yaml down

LLM Configuration

Chunkr supports two ways to configure LLMs:

  1. models.yaml file: Advanced configuration for multiple LLMs with additional options
  2. Environment variables: Simple configuration for a single LLM

Using models.yaml (Recommended)

For more flexible configuration with multiple models, default/fallback options, and rate limits:

  1. Copy the example file to create your configuration:
cp models.example.yaml models.yaml
  1. Edit the models.yaml file with your configuration. Example:
models:
  - id: gpt-4o
    model: gpt-4o
    provider_url: https://api.openai.com/v1/chat/completions
    api_key: "your_openai_api_key_here"
    default: true
    rate-limit: 200 # requests per minute - optional

Benefits of using models.yaml:

  • Configure multiple LLM providers simultaneously
  • Set default and fallback models
  • Add distributed rate limits per model
  • Reference models by ID in API requests (see docs for more info)

Read the models.example.yaml file for more information on the available options.

Using environment variables (Basic)

You can use any OpenAI API compatible endpoint by setting the following variables in your .env file:

LLM__KEY:
LLM__MODEL:
LLM__URL:

Common LLM API Providers

Below is a table of common LLM providers and their configuration details to get you started:

Provider API URL Documentation
OpenAI https://api.openai.com/v1/chat/completions OpenAI Docs
Google AI Studio https://generativelanguage.googleapis.com/v1beta/openai/chat/completions Google AI Docs
OpenRouter https://openrouter.ai/api/v1/chat/completions OpenRouter Models
Self-Hosted http://localhost:8000/v1 VLLM or Ollama

Licensing

The core of this project is dual-licensed:

  1. GNU Affero General Public License v3.0 (AGPL-3.0)
  2. Commercial License

To use Chunkr without complying with the AGPL-3.0 license terms you can contact us or visit our website.

Connect With Us