Skip to content

Enterprise-grade MinerU document parsing service with asynchronous queue processing based on Celery. 企业级 MinerU 文档解析服务,基于 Celery 实现异步队列处理,采用完全解耦的 API/Worker 架构。

License

Notifications You must be signed in to change notification settings

wzdavid/mineru-api

Enterprise-grade document parsing service with asynchronous queue processing based on Celery, featuring a fully decoupled API/Worker architecture.

Features

  • 🚀 Asynchronous Processing: Distributed task queue based on Celery
  • 📄 Multi-format Support: PDF, Office, images, and various document formats
  • 🔄 High Availability: Supports task retry and fault recovery
  • 📊 Real-time Monitoring: Task status tracking and queue statistics
  • 🎯 Priority Queue: Supports task priority scheduling
  • 🔧 Easy to Extend: Modular design, easy to add new parsing engines

Quick Start

Prerequisites

  • Docker and Docker Compose
  • (Optional) NVIDIA GPU for GPU worker

Start Services

  1. Copy environment configuration:

    cp .env.example .env
  2. Start Redis and API:

    cd docker && docker compose up -d redis mineru-api
  3. Start Worker (choose CPU or GPU):

    # CPU Worker (recommended for development)
    cd docker && docker compose --profile mineru-cpu up -d
    
    # GPU Worker (requires NVIDIA GPU)
    cd docker && docker compose --profile mineru-gpu up -d
  4. Verify services:

    curl http://localhost:8000/api/v1/health

That's it! The API is now running at http://localhost:8000.

API Usage

MinerU-API provides two API interfaces to suit different use cases:

1. Official MinerU API (Synchronous)

The /file_parse endpoint is compatible with the official MinerU API format. It submits tasks to the worker and waits for completion, returning results directly in the response.

Reference: MinerU Official API

curl -X POST "http://localhost:8000/file_parse" \
  -F "files=@document.pdf" \
  -F "backend=pipeline" \
  -F "lang_list=ch" \
  -F "parse_method=auto" \
  -F "return_md=true"

Use cases: Simple integration, immediate results needed, compatible with existing MinerU clients.

2. Async Queue API (Asynchronous)

The /api/v1/tasks/submit and /api/v1/tasks/{task_id} endpoints provide an asynchronous queue-based API, compatible with the mineru-tianshu project format.

Reference: mineru-tianshu API

Submit a Task:

curl -X POST "http://localhost:8000/api/v1/tasks/submit" \
  -F "file=@document.pdf" \
  -F "backend=pipeline" \
  -F "lang=ch"

Query Task Status:

curl "http://localhost:8000/api/v1/tasks/{task_id}"

Use cases: Production deployments, batch processing, long-running tasks, better scalability.

View API Documentation

Visit http://localhost:8000/docs for interactive API documentation with full parameter details.

Basic Configuration

Environment Variables

The most important configuration options (see .env.example for all options):

# Redis Configuration
REDIS_URL=redis://redis:6379/0

# Storage Type: local or s3
MINERU_STORAGE_TYPE=local

# For S3 storage (distributed deployment)
MINERU_S3_ENDPOINT=http://minio:9000
MINERU_S3_ACCESS_KEY=minioadmin
MINERU_S3_SECRET_KEY=minioadmin

# CORS Configuration (production)
CORS_ALLOWED_ORIGINS=http://localhost:3000
ENVIRONMENT=production

# File Upload Limits
MAX_FILE_SIZE=104857600  # 100MB

Documentation

Architecture

  • API Service: Handles task submission and status queries (api/app.py)
  • Worker Service: Processes documents using MinerU/MarkItDown (worker/tasks.py)
  • Redis: Message queue and result storage
  • Shared Config: Unified configuration in shared/celeryconfig.py

Development

# Install dependencies
pip install -r api/requirements.txt
pip install -r worker/requirements.txt
pip install -r cleanup/requirements.txt

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

This project is built on top of the following excellent open-source projects:

  • MinerU - The core document parsing engine that powers this service
  • mineru-tianshu - Inspiration and reference for the API architecture

We are grateful to the developers and contributors of these projects for their valuable work.

License

MIT License - see LICENSE file for details.

Third-Party Licenses

This project uses the following open-source libraries:

  • MinerU - Licensed under AGPL-3.0
  • MarkItDown - Licensed under MIT

MinerU is used as an external library and its source code is not included in this repository.

About

Enterprise-grade MinerU document parsing service with asynchronous queue processing based on Celery. 企业级 MinerU 文档解析服务,基于 Celery 实现异步队列处理,采用完全解耦的 API/Worker 架构。

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages