RAG-Crawler: Fast, Grounded Web Content Q&A

Overview

A practical Retrieval-Augmented Generation (RAG) system: given a website URL, crawls in-domain pages, indexes content, and answers questions only from crawled context, with citations and robust refusal support. Built for accuracy, speed, and clarity.

Setup & Quick Start

Requirements
- Python 3.9+
- PostgreSQL with pgvector (CREATE EXTENSION IF NOT EXISTS vector;)
- Mistral-7B-Instruct (or MiniMistral/Phi-2 as backup)
- pip: pip install -r requirements.txt
DB Init
- psql <db_params> -f db_schema.sql
Run API
- uvicorn src.api:app --reload
- FastAPI docs: http://127.0.0.1:8000/docs

Architecture & Tradeoffs (Bullets)

Crawl: BFS, obeys robots.txt & crawl-delay, fetches main text only, in-domain.
Politeness: Crawl delay set per robots.txt; defaults if absent; avoids overloading hosts.
Chunking: 800 chars + 100 overlap—good for retrieval/LLM prompt size (justify in code & README).
Embeddings: OSS MiniLM (sbert) for ~state-of-art chunk retrieval; fast & hardware-light.
Vector DB: PostgreSQL w/pgvector: scalable, familiar, simple to deploy.
Ask: Query embeds to top-k chunk retrieval (cosine), context passed to LLM w/ citations.
Prompt Guardrails: LLM forced to answer from context only; clear refusal phrasing.
Failover: If Mistral-7B exceeds latency threshold, backup LLM (MiniMistral/Phi2) auto-engages; model always logged.
Observability: Logs timings, model picks, refusals, per-query stats; can report p50/p95 latencies from queries table.
Source Trace: Snippet + URL for every answer; transparent refusal returns closest context.
Simple API: FastAPI, no custom UI; leverages OpenAPI docs for interactive calls.
CLI Example: All API endpoints usable via curl or CLI runner (see below).
Reproducibility: Everything open source; all models, embeddings, and prompt logic listed here.
Evaluation: Includes 2 example asks (1 grounded, 1 refusal); metrics easy to export for further eval.
Limitations: Pure HTML (no JS); 50-page crawl cap; accepts 1 or 2 primary URLs (see /crawl param).

Tooling & Prompts Disclosure

Embeddings: sentence-transformers/all-MiniLM-L6-v2
LLM: Mistral-7B-Instruct (local API or vLLM/LM Studio); backup: MiniMistral, Phi-2
Prompt: See src/llm.py, strictly context-limited, refusal logic enforced.

API Usage Examples

1. Crawl

curl -X POST http://localhost:8000/crawl -H 'Content-Type: application/json' -d '{"start_url": "https://docs.python.org/3/", "max_pages": 10}'

2. Index

curl -X POST http://localhost:8000/index -H 'Content-Type: application/json' -d '{}'

3. Ask - answerable

curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "How do you define a function in Python?", "top_k": 3}'

Sample Response:

{
  "answer": "In Python, you define a function using the 'def' keyword...",
  "sources": [
    { "url": "https://docs.python.org/3/tutorial/controlflow.html", "snippet": "Functions are defined using the def keyword..." }
  ],
  "timings": { "retrieval_ms": 35, "generation_ms": 900, "total_ms": 950 },
  "model_used": "mistral-7b",
  "refused": false
}

4. Ask - refusal

curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "What is the airspeed velocity of an unladen swallow?", "top_k": 3}'

Sample Response:

{
  "answer": "Not found in crawled content. Closest context: ...",
  "sources": [ ... ],
  "timings": { "retrieval_ms": 31, "generation_ms": 420, "total_ms": 463 },
  "model_used": "mistral-7b",  // or backup if failover occurred
  "refused": true
}

Metrics Walk-through

Run SELECT percentile_cont(ARRAY[0.5,0.95]) WITHIN GROUP (ORDER BY total_ms) FROM queries; to get p50/p95.
Each query logs answer type (grounded/refusal), model used, and timings.

Maintainer/Evaluator Notes

No proprietary code. All OSS, all prompt logic is auditable.
Attributions: code includes explicit references in model/tool usage sections.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
QUICK_SUMMARY.md		QUICK_SUMMARY.md
RAG_SYSTEM_REPORT.md		RAG_SYSTEM_REPORT.md
README.md		README.md
db_schema.sql		db_schema.sql
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG-Crawler: Fast, Grounded Web Content Q&A

Overview

Setup & Quick Start

Architecture & Tradeoffs (Bullets)

Tooling & Prompts Disclosure

API Usage Examples

1. Crawl

2. Index

3. Ask - answerable

Sample Response:

4. Ask - refusal

Sample Response:

Metrics Walk-through

Maintainer/Evaluator Notes

About

Uh oh!

Releases

Packages

Languages

Daramanohar/RAG-Crawler

Folders and files

Latest commit

History

Repository files navigation

RAG-Crawler: Fast, Grounded Web Content Q&A

Overview

Setup & Quick Start

Architecture & Tradeoffs (Bullets)

Tooling & Prompts Disclosure

API Usage Examples

1. Crawl

2. Index

3. Ask - answerable

Sample Response:

4. Ask - refusal

Sample Response:

Metrics Walk-through

Maintainer/Evaluator Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages