Document Ingester API

AI-powered document ingestion and knowledge base API using DeepSeek-OCR.

Features

📄 Document Upload: Support for PDF and Word documents with validation
🤖 AI-Powered OCR: DeepSeek-OCR model for high-quality text extraction
🧠 Knowledge Base: Automatic structuring of OCR results into searchable entries
⚡ Background Processing: Async task processing with progress tracking
🔍 Search & Retrieval: Full-text search across knowledge base entries
📊 Progress Monitoring: Real-time task status and progress updates
🛡️ Error Handling: Comprehensive error handling and logging
📈 Scalable Architecture: FastAPI + Tortoise ORM + PostgreSQL
🔧 Modern Tooling: uv package management, ruff linting, structured logging

Quick Start

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and setup:

git clone <repository-url>
cd docs-ingester-api
uv sync

Configure the application:

# Create development configuration
cp config.yaml config.dev.yaml
# Edit config.dev.yaml with your development settings

# Or for production
cp config.yaml config.prod.yaml
# Edit config.prod.yaml with your production settings

Run the API:

# Development (uses config.dev.yaml)
uv run python main.py --env dev

# Production (uses config.prod.yaml)
uv run python main.py --env prod

Access the API:
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/api/v1/health

API Endpoints

Documents

POST /api/v1/documents/upload - Upload a document for processing
POST /api/v1/documents/{id}/process - Start OCR processing (async/background)
GET /api/v1/documents/ - List all documents with pagination
GET /api/v1/documents/{id} - Get detailed document information
GET /api/v1/documents/{id}/download - Download the original document file
DELETE /api/v1/documents/{id} - Delete a document and its data

Knowledge Base

GET /api/v1/documents/{id}/knowledge-base - Get knowledge base entries for a document
GET /api/v1/documents/knowledge-base/search - Search knowledge base entries

Background Tasks

GET /api/v1/documents/tasks/{task_id} - Check processing task status

Health & Monitoring

GET /health - Basic health check
GET /api/v1/health/ - Detailed health check

Configuration

The application supports configuration through YAML files and environment variables. Configuration is loaded in this priority order (highest to lowest):

Command-line arguments
Environment variables
Environment-specific YAML file (config.{env}.yaml)
Base YAML file (config.yaml)
Default values

YAML Configuration Files

config.yaml: Base configuration template
config.dev.yaml: Development-specific overrides (copy from config.yaml)
config.prod.yaml: Production-specific overrides (copy from config.yaml)

Example YAML Structure

project:
  name: "Document Ingester API"
  description: "AI-powered document ingestion and knowledge base API"
  version: "0.1.0"
  api_v1_str: "/api/v1"

server:
  host: "0.0.0.0"
  port: 8000
  reload: true

database:
  url: "postgresql://user:CHANGE_ME@localhost:5432/docs_ingester"

ocr:
  deepseek_model_path: "deepseek-ai/DeepSeek-OCR"
  vllm_max_concurrency: 10
  vllm_gpu_memory_utilization: 0.9
  dpi: 144
  prompt: "<image>\n<|grounding|>Convert the document to markdown."

file_upload:
  max_size: 52428800  # 50MB in bytes
  allowed_extensions:
    - ".pdf"
    - ".doc"
    - ".docx"

background_processing:
  max_workers: 4

security:
  # Use "${SECRET_KEY}" to read from environment variable in production
  secret_key: "CHANGE_THIS_IN_PRODUCTION"
  access_token_expire_minutes: 40320  # 28 days

logging:
  level: "INFO"

Environment Variable Substitution

YAML configuration files can reference environment variables using ${VARIABLE_NAME} syntax. For example:

database:
  url: "${DATABASE_URL}"

security:
  secret_key: "${SECRET_KEY}"

Command-line Arguments

You can override configuration settings with command-line arguments:

# Run with production config
uv run python main.py --env prod

# Override specific settings
uv run python main.py --env dev --port 8080 --log-level DEBUG

Traditional Environment Variables

The application still supports traditional environment variables for compatibility:

Development

Running Tests

uv run pytest

Code Formatting

uv run ruff format .
uv run ruff check . --fix

Database Migrations

The application uses Tortoise ORM's built-in schema generation for development.

Architecture

FastAPI: Web framework for API endpoints
Tortoise ORM: Async ORM for PostgreSQL
DeepSeek-OCR: AI model for document OCR
vLLM: High-throughput inference engine
uv: Modern Python project management

License

[Add your license here]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.example.yaml		config.example.yaml
config.yaml		config.yaml
main.py		main.py
pyproject.toml		pyproject.toml
run.sh		run.sh
test_setup.py		test_setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Ingester API

Features

Quick Start

API Endpoints

Documents

Knowledge Base

Background Tasks

Health & Monitoring

Configuration

YAML Configuration Files

Example YAML Structure

Environment Variable Substitution

Command-line Arguments

Traditional Environment Variables

Development

Running Tests

Code Formatting

Database Migrations

Architecture

License

About

Uh oh!

Releases

Packages

Languages

dvictor357/docs-ingester

Folders and files

Latest commit

History

Repository files navigation

Document Ingester API

Features

Quick Start

API Endpoints

Documents

Knowledge Base

Background Tasks

Health & Monitoring

Configuration

YAML Configuration Files

Example YAML Structure

Environment Variable Substitution

Command-line Arguments

Traditional Environment Variables

Development

Running Tests

Code Formatting

Database Migrations

Architecture

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages