AI-powered document ingestion and knowledge base API using DeepSeek-OCR.
- 📄 Document Upload: Support for PDF and Word documents with validation
- 🤖 AI-Powered OCR: DeepSeek-OCR model for high-quality text extraction
- 🧠 Knowledge Base: Automatic structuring of OCR results into searchable entries
- ⚡ Background Processing: Async task processing with progress tracking
- 🔍 Search & Retrieval: Full-text search across knowledge base entries
- 📊 Progress Monitoring: Real-time task status and progress updates
- 🛡️ Error Handling: Comprehensive error handling and logging
- 📈 Scalable Architecture: FastAPI + Tortoise ORM + PostgreSQL
- 🔧 Modern Tooling: uv package management, ruff linting, structured logging
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Clone and setup:
git clone <repository-url> cd docs-ingester-api uv sync
-
Configure the application:
# Create development configuration cp config.yaml config.dev.yaml # Edit config.dev.yaml with your development settings # Or for production cp config.yaml config.prod.yaml # Edit config.prod.yaml with your production settings
-
Run the API:
# Development (uses config.dev.yaml) uv run python main.py --env dev # Production (uses config.prod.yaml) uv run python main.py --env prod
-
Access the API:
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/api/v1/health
POST /api/v1/documents/upload- Upload a document for processingPOST /api/v1/documents/{id}/process- Start OCR processing (async/background)GET /api/v1/documents/- List all documents with paginationGET /api/v1/documents/{id}- Get detailed document informationGET /api/v1/documents/{id}/download- Download the original document fileDELETE /api/v1/documents/{id}- Delete a document and its data
GET /api/v1/documents/{id}/knowledge-base- Get knowledge base entries for a documentGET /api/v1/documents/knowledge-base/search- Search knowledge base entries
GET /api/v1/documents/tasks/{task_id}- Check processing task status
GET /health- Basic health checkGET /api/v1/health/- Detailed health check
The application supports configuration through YAML files and environment variables. Configuration is loaded in this priority order (highest to lowest):
- Command-line arguments
- Environment variables
- Environment-specific YAML file (
config.{env}.yaml) - Base YAML file (
config.yaml) - Default values
- config.yaml: Base configuration template
- config.dev.yaml: Development-specific overrides (copy from config.yaml)
- config.prod.yaml: Production-specific overrides (copy from config.yaml)
project:
name: "Document Ingester API"
description: "AI-powered document ingestion and knowledge base API"
version: "0.1.0"
api_v1_str: "/api/v1"
server:
host: "0.0.0.0"
port: 8000
reload: true
database:
url: "postgresql://user:CHANGE_ME@localhost:5432/docs_ingester"
ocr:
deepseek_model_path: "deepseek-ai/DeepSeek-OCR"
vllm_max_concurrency: 10
vllm_gpu_memory_utilization: 0.9
dpi: 144
prompt: "<image>\n<|grounding|>Convert the document to markdown."
file_upload:
max_size: 52428800 # 50MB in bytes
allowed_extensions:
- ".pdf"
- ".doc"
- ".docx"
background_processing:
max_workers: 4
security:
# Use "${SECRET_KEY}" to read from environment variable in production
secret_key: "CHANGE_THIS_IN_PRODUCTION"
access_token_expire_minutes: 40320 # 28 days
logging:
level: "INFO"YAML configuration files can reference environment variables using ${VARIABLE_NAME} syntax. For example:
database:
url: "${DATABASE_URL}"
security:
secret_key: "${SECRET_KEY}"You can override configuration settings with command-line arguments:
# Run with production config
uv run python main.py --env prod
# Override specific settings
uv run python main.py --env dev --port 8080 --log-level DEBUGThe application still supports traditional environment variables for compatibility:
uv run pytestuv run ruff format .
uv run ruff check . --fixThe application uses Tortoise ORM's built-in schema generation for development.
- FastAPI: Web framework for API endpoints
- Tortoise ORM: Async ORM for PostgreSQL
- DeepSeek-OCR: AI model for document OCR
- vLLM: High-throughput inference engine
- uv: Modern Python project management
[Add your license here]