Skip to content

A practical Retrieval-Augmented Generation (RAG) system: given a website URL, crawls in-domain pages, indexes content, and answers questions only from crawled context, with citations and robust refusal support. Built for accuracy, speed, and clarity.

Notifications You must be signed in to change notification settings

Daramanohar/RAG-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-Crawler: Fast, Grounded Web Content Q&A

Overview

A practical Retrieval-Augmented Generation (RAG) system: given a website URL, crawls in-domain pages, indexes content, and answers questions only from crawled context, with citations and robust refusal support. Built for accuracy, speed, and clarity.


Setup & Quick Start

  1. Requirements

    • Python 3.9+
    • PostgreSQL with pgvector (CREATE EXTENSION IF NOT EXISTS vector;)
    • Mistral-7B-Instruct (or MiniMistral/Phi-2 as backup)
    • pip: pip install -r requirements.txt
  2. DB Init

    • psql <db_params> -f db_schema.sql
  3. Run API

    • uvicorn src.api:app --reload
    • FastAPI docs: http://127.0.0.1:8000/docs

Architecture & Tradeoffs (Bullets)

  1. Crawl: BFS, obeys robots.txt & crawl-delay, fetches main text only, in-domain.
  2. Politeness: Crawl delay set per robots.txt; defaults if absent; avoids overloading hosts.
  3. Chunking: 800 chars + 100 overlap—good for retrieval/LLM prompt size (justify in code & README).
  4. Embeddings: OSS MiniLM (sbert) for ~state-of-art chunk retrieval; fast & hardware-light.
  5. Vector DB: PostgreSQL w/pgvector: scalable, familiar, simple to deploy.
  6. Ask: Query embeds to top-k chunk retrieval (cosine), context passed to LLM w/ citations.
  7. Prompt Guardrails: LLM forced to answer from context only; clear refusal phrasing.
  8. Failover: If Mistral-7B exceeds latency threshold, backup LLM (MiniMistral/Phi2) auto-engages; model always logged.
  9. Observability: Logs timings, model picks, refusals, per-query stats; can report p50/p95 latencies from queries table.
  10. Source Trace: Snippet + URL for every answer; transparent refusal returns closest context.
  11. Simple API: FastAPI, no custom UI; leverages OpenAPI docs for interactive calls.
  12. CLI Example: All API endpoints usable via curl or CLI runner (see below).
  13. Reproducibility: Everything open source; all models, embeddings, and prompt logic listed here.
  14. Evaluation: Includes 2 example asks (1 grounded, 1 refusal); metrics easy to export for further eval.
  15. Limitations: Pure HTML (no JS); 50-page crawl cap; accepts 1 or 2 primary URLs (see /crawl param).

Tooling & Prompts Disclosure

  • Embeddings: sentence-transformers/all-MiniLM-L6-v2
  • LLM: Mistral-7B-Instruct (local API or vLLM/LM Studio); backup: MiniMistral, Phi-2
  • Prompt: See src/llm.py, strictly context-limited, refusal logic enforced.

API Usage Examples

1. Crawl

curl -X POST http://localhost:8000/crawl -H 'Content-Type: application/json' -d '{"start_url": "https://docs.python.org/3/", "max_pages": 10}'

2. Index

curl -X POST http://localhost:8000/index -H 'Content-Type: application/json' -d '{}'

3. Ask - answerable

curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "How do you define a function in Python?", "top_k": 3}'

Sample Response:

{
  "answer": "In Python, you define a function using the 'def' keyword...",
  "sources": [
    { "url": "https://docs.python.org/3/tutorial/controlflow.html", "snippet": "Functions are defined using the def keyword..." }
  ],
  "timings": { "retrieval_ms": 35, "generation_ms": 900, "total_ms": 950 },
  "model_used": "mistral-7b",
  "refused": false
}

4. Ask - refusal

curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "What is the airspeed velocity of an unladen swallow?", "top_k": 3}'

Sample Response:

{
  "answer": "Not found in crawled content. Closest context: ...",
  "sources": [ ... ],
  "timings": { "retrieval_ms": 31, "generation_ms": 420, "total_ms": 463 },
  "model_used": "mistral-7b",  // or backup if failover occurred
  "refused": true
}

Metrics Walk-through

  • Run SELECT percentile_cont(ARRAY[0.5,0.95]) WITHIN GROUP (ORDER BY total_ms) FROM queries; to get p50/p95.
  • Each query logs answer type (grounded/refusal), model used, and timings.

Maintainer/Evaluator Notes

  • No proprietary code. All OSS, all prompt logic is auditable.
  • Attributions: code includes explicit references in model/tool usage sections.

About

A practical Retrieval-Augmented Generation (RAG) system: given a website URL, crawls in-domain pages, indexes content, and answers questions only from crawled context, with citations and robust refusal support. Built for accuracy, speed, and clarity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages