Skip to content

karimluna/madus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

68 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Madus

Multi-Modal Agentic Document Understanding System

License: MIT Python

MADUS is a production-oriented multi-agent framework for document question answering over complex PDFs containing text, figures, and tables. It adapts the research architecture of MDocAgent (Han et al., 2025) with the addition of a hierarchical orchestator into a scalable, deployable system built on LangGraph, FastAPI, and Docker.

Architecture

MADUS implements Hierarchical Orchestration over five specialized agents. A single General Orchestrator reasons over the full task using the ReAct loop (Yao et al., 2022), decomposes it into modality-specific subtasks, and dispatches them in parallel. Results converge at a Critic agent that implements the Reflexion pattern (Shinn et al., 2023) before a final Summarizer produces the answer.

Architecture diagram

Why Hierarchical Orchestration

In a flat multi-agent system every agent communicates with every other agent, producing $O(n^2)$ message complexity and fragile coordination. The hierarchical pattern reduces this to $O(n)$: the orchestrator is the single coordination point, agents are stateless workers, and the shared LangGraph state is the only communication channel. Adding a new modality agent means writing one node and one edge, not rethinking the coordination protocol.

The orchestrator is not a static router. It implements ReAct, meaning its routing decision at step $t$ conditions on the full trajectory:

$$c_t = \bigl(q,; \tau_1, a_1, o_1,; \tau_2, a_2, o_2,; \ldots,; \tau_t\bigr)$$

On a retry triggered by the critic, $c_t$ includes the verbal feedback $v_k$, so the orchestrator can selectively re-dispatch only the agent that failed rather than all three.

Reflexion as a feedback gate

The Critic implements verbal reinforcement: rather than updating model weights, it produces a natural-language reflection $v_k$ that conditions the next attempt. For agent output $y_k$ at attempt $k$:

$$v_k = \text{Critic}(q,; y_k^{\text{text}},; y_k^{\text{image}},; y_k^{\text{table}})$$ $$y_{k+1} = \text{Agent}(q,; v_k,; \text{context})$$

The retry loop is bounded at two iterations. If the answer is still insufficient after three total attempts, the Summarizer surfaces the critique alongside the best available answer.

Retrieval

Text retrieval uses hybrid Reciprocal Rank Fusion over BM25 and dense semantic search:

$$\text{RRF}(d) = \sum_{r \in {\text{BM25},, \text{semantic}}} \frac{1}{60 + \text{rank}_r(d)}$$

BM25 catches exact keyword matches that embeddings miss; semantic search catches paraphrase matches that BM25 misses. RRF fuses both ranked lists without requiring score calibration between rankers.

API cost estimate (OpenAI backend)

Costs per cold run on a typical 20-page academic PDF with a single question. A cache hit on a repeated document costs $0.00.

Agent Model Approx. tokens Approx. cost
Orchestrator gpt-4o-mini 600 in / 100 out $0.001
Text agent gpt-4o-mini 2,000 in / 300 out $0.003
Image agent (2 images, low detail) gpt-4o 800 in / 300 out $0.020
Image agent (4 images, high detail) gpt-4o 3,000 in / 300 out $0.060
Table agent gpt-4o-mini 1,000 in / 300 out $0.002
Critic gpt-4o-mini 1,500 in / 200 out $0.002
Summarizer gpt-4o-mini 2,000 in / 400 out $0.003
Embeddings (index + query) text-embedding-3-small ~15,000 tokens $0.001
Single cold run, low detail ~$0.032
Single cold run, high detail ~$0.072
Retry triggered (one extra pass) +$0.025

Image agent cost dominates everything else. detail: "high" tiles each image into 512x512 crops at 170 tokens per tile, so four dense figures can exceed the combined cost of all other agents. Use detail: "low" (fixed 85 tokens per image) unless the question specifically requires reading fine-grained chart content.

To run at zero cost replace the two env vars:

LLM_BACKEND=local        # routes all agents through Ollama
EMBEDDING_BACKEND=local  # uses BAAI/bge-small-en-v1.5 via transformers, no API key needed

Requires ollama pull llama3.2 | qwen2.5:1.5b | <text_generator> and ollama pull qwen2-vl:7b | moondream | <image-text-to-text>, this depends on your loaded models, however is easily configured in core/config.py. Tested on RTX 4060 8GB (each model fits in VRAM, loaded sequentially by Ollama).

Quickstart

git clone https://github.com/<youruser>/madus && cd madus

cp .env.example .env          # add OPENAI_API_KEY or set LLM_BACKEND=local

docker compose up -d          # starts Redis, ChromaDB, n8n, API

`curl -X POST http://localhost:8000/api/analyze \
  -F "file=@your_document.pdf" \
  -F "question=What is the main finding?"

Tip

For the integration of n8n check out n8n workflow.

Project structure

madus/
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ api/              FastAPI routes
β”‚   β”œβ”€β”€ reasoning/        LangGraph graph, nodes, tools
β”‚   └── extraction/       OCR, layout detection, table parsing
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ models.py         DocumentState schema (system contract)
β”‚   β”œβ”€β”€ embeddings.py     OpenAI and local embedding backends
β”‚   β”œβ”€β”€ cache.py          Redis SHA-256 content cache
β”‚   └── config.py         LLM factory, env-based backend switching
β”œβ”€β”€ configs/prompts/      Versioned prompt templates
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/             Extraction tests, no LLM
β”‚   └── integration/      Full graph tests on real PDFs
└── docker-compose.yml

Acknowledgements

MADUS adapts the architecture of MDocAgent (2025) into a production system. The orchestrator's dynamic routing relies on the ReAct (2022) paradigm, while the Critic agent implements the Reflexion (2023) pattern for verbal self-correction. Text retrieval fuses keyword and semantic rankings via Reciprocal Rank Fusion (2009) over vector indices powered by HNSW (2018). Visual processing concepts are inspired by ColPali (2024).

About

🐿Madus: A production-ready multi-agent framework that understands text, figures, and tables in complex PDFs to answer natural language questions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages