Skip to content

prodesk98/advanced-deep-research

Repository files navigation

🧠 Advanced Deep Research

Advanced Deep Research is an autonomous multi-agent research framework designed to simulate a human-level deep researcher. It breaks down complex queries into actionable sub-questions, performs real-time searches across multiple sources (web, papers, and local vector DB), and synthesizes the most relevant information into clear, didactic summaries.


🚀 Features

  • 🔍 Sub-question generation using a local LLM (Qwen 2.5)
  • 🌐 Web search via Brave Search, Google, or SerpAPI
  • 📄 Advanced content extraction from HTML and PDFs (with pymupdf4llm)
  • ✍️ Chunked summarization using facebook/bart-large-cnn (fine-tuned)
  • 🎯 Relevance filtering via jina-reranker-v2-base-multilingual (threshold: 0.5)
  • 🗂 Knowledge storage in a local vector DB (Qdrant)
  • 🤖 Reflective agent to determine when to stop searching
  • 📘 Final summarizer agent for clear, didactic answers

⚙️ Tech Stack

Component Technology/Model
LLM (main) Qwen 2.5 via vLLM (OpenAI-compatible API)
Embeddings jinaai/jina-embeddings-v3
Summarization facebook/bart-large-cnn
Re-ranker jinaai/jina-reranker-v2-base-multilingual
Vector Storage Qdrant
PDF Parsing pymupdf4llm
Web Search Brave API, Google (local), SerpAPI, Tavily
Backend FastAPI + Transformers(Hugging Face)

📂 Project Structure

resumidor/
├── cache/                   # Caching utilities
├── config/                  # Configuration and environment handling
├── databases/               # DB integrations (e.g., Qdrant)
├── deep_searcher/           # Core loop for deep search
├── dockers/                 # Docker configurations
├── factory/                 # Model and service instantiation
├── llm/                     # LLM interaction logic (Qwen, etc.) and tools
├── management/              # Process managers / controllers
├── models/                  # Model loading and handling
├── parsers/                 # Web & PDF content parsers
├── prompt_engineering/      # Prompt templates
├── researchers/             # Research engines
├── schemas/                 # Pydantic schemas
├── server/                  # FastAPI server logic
├── tests/                   # Unit and integration tests

🧰 Installation

1. Clone the repository

git clone https://github.com/prodesk98/advanced-deep-research.git
cd advanced-deep-research

2. Setup environment variables

Copy .env.example to .env and set your keys:

cp .env.example .env

Fill in your credentials:

GOOGLE_SEARCH_ENGINE=local,brave,serpapi
CRAWLER_ENGINE=local,firecrawl
BRAVE_API_KEY=your_key
SERPAPI_KEY=your_key
FIRECRAWL_API_KEY=your_key
HF_TOKEN=your_huggingface_token

3. Install dependencies

Using Poetry:

pip install poetry
poetry install

4. Download models

poetry run python -m download_cli.py

🐳 Docker Deployment

docker compose up -d

App runs at: http://localhost:8501


🧠 Research Pipeline (Simplified)

graph TD
    UI[User Interface] --> Q[User Question] --> SQ[Sub-questions]
    SQ --> WS[Search: Brave / Google / ArXiv]
    WS --> XT[Extract Content]
    XT --> SM[Summarize]
    SM --> RK[Re-rank Relevant Info]
    RK --> RF[Reflect: Is Answer Complete?]
    RF -- No --> SQ
    RF -- Yes --> DS[Didactic Final Summary]
    DS --> DB[Store in Vector DB]
Loading

📌 Roadmap

  • Sub-questioning + multi-source search
  • ArXiv PDF extraction
  • Chunked summarization with BART
  • Reranker filtering (threshold-based)
  • Reflective agent for iterative research
  • Final summarizer for clarity
  • CLI / Web Interface
  • Export to Markdown / PDF
  • Chrome/Firefox extension for contextual search

📜 License

MIT License


🤝 Contributing

Open issues, submit pull requests, or suggest improvements!
All contributions are welcome.

About

Automated Deep Research with LLMs, web search, paper parsing, and didactic summarization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published