A practical Retrieval-Augmented Generation (RAG) system: given a website URL, crawls in-domain pages, indexes content, and answers questions only from crawled context, with citations and robust refusal support. Built for accuracy, speed, and clarity.
-
Requirements
- Python 3.9+
- PostgreSQL with pgvector (
CREATE EXTENSION IF NOT EXISTS vector;) - Mistral-7B-Instruct (or
MiniMistral/Phi-2as backup) - pip:
pip install -r requirements.txt
-
DB Init
psql <db_params> -f db_schema.sql
-
Run API
uvicorn src.api:app --reload- FastAPI docs:
http://127.0.0.1:8000/docs
- Crawl: BFS, obeys robots.txt & crawl-delay, fetches main text only, in-domain.
- Politeness: Crawl delay set per robots.txt; defaults if absent; avoids overloading hosts.
- Chunking: 800 chars + 100 overlap—good for retrieval/LLM prompt size (justify in code & README).
- Embeddings: OSS MiniLM (sbert) for ~state-of-art chunk retrieval; fast & hardware-light.
- Vector DB: PostgreSQL w/pgvector: scalable, familiar, simple to deploy.
- Ask: Query embeds to top-k chunk retrieval (cosine), context passed to LLM w/ citations.
- Prompt Guardrails: LLM forced to answer from context only; clear refusal phrasing.
- Failover: If Mistral-7B exceeds latency threshold, backup LLM (MiniMistral/Phi2) auto-engages; model always logged.
- Observability: Logs timings, model picks, refusals, per-query stats; can report p50/p95 latencies from queries table.
- Source Trace: Snippet + URL for every answer; transparent refusal returns closest context.
- Simple API: FastAPI, no custom UI; leverages OpenAPI docs for interactive calls.
- CLI Example: All API endpoints usable via curl or CLI runner (see below).
- Reproducibility: Everything open source; all models, embeddings, and prompt logic listed here.
- Evaluation: Includes 2 example asks (1 grounded, 1 refusal); metrics easy to export for further eval.
- Limitations: Pure HTML (no JS); 50-page crawl cap; accepts 1 or 2 primary URLs (see /crawl param).
- Embeddings:
sentence-transformers/all-MiniLM-L6-v2 - LLM:
Mistral-7B-Instruct(local API or vLLM/LM Studio); backup:MiniMistral,Phi-2 - Prompt: See src/llm.py, strictly context-limited, refusal logic enforced.
curl -X POST http://localhost:8000/crawl -H 'Content-Type: application/json' -d '{"start_url": "https://docs.python.org/3/", "max_pages": 10}'
curl -X POST http://localhost:8000/index -H 'Content-Type: application/json' -d '{}'
curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "How do you define a function in Python?", "top_k": 3}'
{
"answer": "In Python, you define a function using the 'def' keyword...",
"sources": [
{ "url": "https://docs.python.org/3/tutorial/controlflow.html", "snippet": "Functions are defined using the def keyword..." }
],
"timings": { "retrieval_ms": 35, "generation_ms": 900, "total_ms": 950 },
"model_used": "mistral-7b",
"refused": false
}
curl -X POST http://localhost:8000/ask -H 'Content-Type: application/json' -d '{"question": "What is the airspeed velocity of an unladen swallow?", "top_k": 3}'
{
"answer": "Not found in crawled content. Closest context: ...",
"sources": [ ... ],
"timings": { "retrieval_ms": 31, "generation_ms": 420, "total_ms": 463 },
"model_used": "mistral-7b", // or backup if failover occurred
"refused": true
}
- Run
SELECT percentile_cont(ARRAY[0.5,0.95]) WITHIN GROUP (ORDER BY total_ms) FROM queries;to get p50/p95. - Each query logs answer type (grounded/refusal), model used, and timings.
- No proprietary code. All OSS, all prompt logic is auditable.
- Attributions: code includes explicit references in model/tool usage sections.