Multi-Modal Agentic Document Understanding System
MADUS is a production-oriented multi-agent framework for document question answering over complex PDFs containing text, figures, and tables. It adapts the research architecture of MDocAgent (Han et al., 2025) with the addition of a hierarchical orchestator into a scalable, deployable system built on LangGraph, FastAPI, and Docker.
MADUS implements Hierarchical Orchestration over five specialized agents. A single General Orchestrator reasons over the full task using the ReAct loop (Yao et al., 2022), decomposes it into modality-specific subtasks, and dispatches them in parallel. Results converge at a Critic agent that implements the Reflexion pattern (Shinn et al., 2023) before a final Summarizer produces the answer.
In a flat multi-agent system every agent communicates with every other agent, producing
The orchestrator is not a static router. It implements ReAct, meaning its routing decision at step
On a retry triggered by the critic,
The Critic implements verbal reinforcement: rather than updating model weights, it produces a natural-language reflection
The retry loop is bounded at two iterations. If the answer is still insufficient after three total attempts, the Summarizer surfaces the critique alongside the best available answer.
Text retrieval uses hybrid Reciprocal Rank Fusion over BM25 and dense semantic search:
BM25 catches exact keyword matches that embeddings miss; semantic search catches paraphrase matches that BM25 misses. RRF fuses both ranked lists without requiring score calibration between rankers.
Costs per cold run on a typical 20-page academic PDF with a single question. A cache hit on a repeated document costs $0.00.
| Agent | Model | Approx. tokens | Approx. cost |
|---|---|---|---|
| Orchestrator | gpt-4o-mini | 600 in / 100 out | $0.001 |
| Text agent | gpt-4o-mini | 2,000 in / 300 out | $0.003 |
| Image agent (2 images, low detail) | gpt-4o | 800 in / 300 out | $0.020 |
| Image agent (4 images, high detail) | gpt-4o | 3,000 in / 300 out | $0.060 |
| Table agent | gpt-4o-mini | 1,000 in / 300 out | $0.002 |
| Critic | gpt-4o-mini | 1,500 in / 200 out | $0.002 |
| Summarizer | gpt-4o-mini | 2,000 in / 400 out | $0.003 |
| Embeddings (index + query) | text-embedding-3-small | ~15,000 tokens | $0.001 |
| Single cold run, low detail | ~$0.032 | ||
| Single cold run, high detail | ~$0.072 | ||
| Retry triggered (one extra pass) | +$0.025 |
Image agent cost dominates everything else. detail: "high" tiles each image into 512x512 crops at 170 tokens per tile, so four dense figures can exceed the combined cost of all other agents. Use detail: "low" (fixed 85 tokens per image) unless the question specifically requires reading fine-grained chart content.
To run at zero cost replace the two env vars:
LLM_BACKEND=local # routes all agents through Ollama
EMBEDDING_BACKEND=local # uses BAAI/bge-small-en-v1.5 via transformers, no API key neededRequires ollama pull llama3.2 | qwen2.5:1.5b | <text_generator> and ollama pull qwen2-vl:7b | moondream | <image-text-to-text>, this depends on your loaded models, however is easily configured in core/config.py. Tested on RTX 4060 8GB (each model fits in VRAM, loaded sequentially by Ollama).
git clone https://github.com/<youruser>/madus && cd madus
cp .env.example .env # add OPENAI_API_KEY or set LLM_BACKEND=local
docker compose up -d # starts Redis, ChromaDB, n8n, API
`curl -X POST http://localhost:8000/api/analyze \
-F "file=@your_document.pdf" \
-F "question=What is the main finding?"Tip
For the integration of n8n check out n8n workflow.
madus/
βββ services/
β βββ api/ FastAPI routes
β βββ reasoning/ LangGraph graph, nodes, tools
β βββ extraction/ OCR, layout detection, table parsing
βββ core/
β βββ models.py DocumentState schema (system contract)
β βββ embeddings.py OpenAI and local embedding backends
β βββ cache.py Redis SHA-256 content cache
β βββ config.py LLM factory, env-based backend switching
βββ configs/prompts/ Versioned prompt templates
βββ tests/
β βββ unit/ Extraction tests, no LLM
β βββ integration/ Full graph tests on real PDFs
βββ docker-compose.yml
MADUS adapts the architecture of MDocAgent (2025) into a production system. The orchestrator's dynamic routing relies on the ReAct (2022) paradigm, while the Critic agent implements the Reflexion (2023) pattern for verbal self-correction. Text retrieval fuses keyword and semantic rankings via Reciprocal Rank Fusion (2009) over vector indices powered by HNSW (2018). Visual processing concepts are inspired by ColPali (2024).
