Skip to content

dvictor357/docs-ingester

Repository files navigation

Document Ingester API

AI-powered document ingestion and knowledge base API using DeepSeek-OCR.

Features

  • 📄 Document Upload: Support for PDF and Word documents with validation
  • 🤖 AI-Powered OCR: DeepSeek-OCR model for high-quality text extraction
  • 🧠 Knowledge Base: Automatic structuring of OCR results into searchable entries
  • ⚡ Background Processing: Async task processing with progress tracking
  • 🔍 Search & Retrieval: Full-text search across knowledge base entries
  • 📊 Progress Monitoring: Real-time task status and progress updates
  • 🛡️ Error Handling: Comprehensive error handling and logging
  • 📈 Scalable Architecture: FastAPI + Tortoise ORM + PostgreSQL
  • 🔧 Modern Tooling: uv package management, ruff linting, structured logging

Quick Start

  1. Install uv (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone and setup:

    git clone <repository-url>
    cd docs-ingester-api
    uv sync
  3. Configure the application:

    # Create development configuration
    cp config.yaml config.dev.yaml
    # Edit config.dev.yaml with your development settings
    
    # Or for production
    cp config.yaml config.prod.yaml
    # Edit config.prod.yaml with your production settings
  4. Run the API:

    # Development (uses config.dev.yaml)
    uv run python main.py --env dev
    
    # Production (uses config.prod.yaml)
    uv run python main.py --env prod
  5. Access the API:

API Endpoints

Documents

  • POST /api/v1/documents/upload - Upload a document for processing
  • POST /api/v1/documents/{id}/process - Start OCR processing (async/background)
  • GET /api/v1/documents/ - List all documents with pagination
  • GET /api/v1/documents/{id} - Get detailed document information
  • GET /api/v1/documents/{id}/download - Download the original document file
  • DELETE /api/v1/documents/{id} - Delete a document and its data

Knowledge Base

  • GET /api/v1/documents/{id}/knowledge-base - Get knowledge base entries for a document
  • GET /api/v1/documents/knowledge-base/search - Search knowledge base entries

Background Tasks

  • GET /api/v1/documents/tasks/{task_id} - Check processing task status

Health & Monitoring

  • GET /health - Basic health check
  • GET /api/v1/health/ - Detailed health check

Configuration

The application supports configuration through YAML files and environment variables. Configuration is loaded in this priority order (highest to lowest):

  1. Command-line arguments
  2. Environment variables
  3. Environment-specific YAML file (config.{env}.yaml)
  4. Base YAML file (config.yaml)
  5. Default values

YAML Configuration Files

  1. config.yaml: Base configuration template
  2. config.dev.yaml: Development-specific overrides (copy from config.yaml)
  3. config.prod.yaml: Production-specific overrides (copy from config.yaml)

Example YAML Structure

project:
  name: "Document Ingester API"
  description: "AI-powered document ingestion and knowledge base API"
  version: "0.1.0"
  api_v1_str: "/api/v1"

server:
  host: "0.0.0.0"
  port: 8000
  reload: true

database:
  url: "postgresql://user:CHANGE_ME@localhost:5432/docs_ingester"

ocr:
  deepseek_model_path: "deepseek-ai/DeepSeek-OCR"
  vllm_max_concurrency: 10
  vllm_gpu_memory_utilization: 0.9
  dpi: 144
  prompt: "<image>\n<|grounding|>Convert the document to markdown."

file_upload:
  max_size: 52428800  # 50MB in bytes
  allowed_extensions:
    - ".pdf"
    - ".doc"
    - ".docx"

background_processing:
  max_workers: 4

security:
  # Use "${SECRET_KEY}" to read from environment variable in production
  secret_key: "CHANGE_THIS_IN_PRODUCTION"
  access_token_expire_minutes: 40320  # 28 days

logging:
  level: "INFO"

Environment Variable Substitution

YAML configuration files can reference environment variables using ${VARIABLE_NAME} syntax. For example:

database:
  url: "${DATABASE_URL}"

security:
  secret_key: "${SECRET_KEY}"

Command-line Arguments

You can override configuration settings with command-line arguments:

# Run with production config
uv run python main.py --env prod

# Override specific settings
uv run python main.py --env dev --port 8080 --log-level DEBUG

Traditional Environment Variables

The application still supports traditional environment variables for compatibility:

Development

Running Tests

uv run pytest

Code Formatting

uv run ruff format .
uv run ruff check . --fix

Database Migrations

The application uses Tortoise ORM's built-in schema generation for development.

Architecture

  • FastAPI: Web framework for API endpoints
  • Tortoise ORM: Async ORM for PostgreSQL
  • DeepSeek-OCR: AI model for document OCR
  • vLLM: High-throughput inference engine
  • uv: Modern Python project management

License

[Add your license here]

About

AI-powered document ingestion and knowledge base API using DeepSeek-OCR.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published