See also: docs/REQUIREMENTS.md for a friendly overview, architecture diagram, and quick commands.
Architecture details: docs/ARCHITECTURE.md
A minimal, production-ready skeleton for a company policy QA system using:
- Chroma as vector store
- Snowflake Arctic embeddings via Ollama (
snowflake-arctic-embed:335m
) - Local LLM via Ollama for orchestration
- PII masking before any LLM calls
- Guard + Orchestrator agents
src/rag/
packages for agents, chunking, embeddings, vector store, LLM client, PII, promptsdata/
input PDFs (already present)app.py
CLI entry point
Environment (PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt
cp .env.example .env
Install and run Ollama, then pull models:
winget install Ollama.Ollama
ollama pull snowflake-arctic-embed:335m
ollama pull llama3.1:8b
Optional configuration:
$Env:OLLAMA_HOST="http://127.0.0.1:11434"
$Env:OLLAMA_EMBED_MODEL="snowflake-arctic-embed:335m"
$Env:OLLAMA_CHAT_MODEL="llama3.1:8b"
$Env:CHROMA_DIR=".chroma"
Run Ollama on a custom port (if 11434 is slow/blocked):
# Example: run on 11435
$Env:OLLAMA_HOST = "127.0.0.1:11435" # scheme optional; defaults to http://
ollama serve --host 127.0.0.1 --port 11435
# In another terminal (same session or set OLLAMA_HOST again if new window)
ollama pull snowflake-arctic-embed:335m
ollama pull llama3.1:8b
The app reads OLLAMA_HOST for both embeddings and chat.
python app.py index
python app.py query "What is our sickness absence policy?"
Outputs an answer and a Sources list with filenames and scores.
- PII masking is applied to user inputs and LLM payloads.
- Chunking: default HYBRID_TOK (headings + sentence packing token-aware). Configure via env:
$Env:CHUNK_MODE="HYBRID_TOK"
(orHYBRID
,HEADING
)$Env:CHUNK_MAX_TOKENS="500"
$Env:CHUNK_OVERLAP_TOKENS="60"
$Env:CHUNK_MAX_CHARS="1200"
$Env:CHUNK_OVERLAP="150"
Tunables are read byIndexPipeline
.
- Prompts live under
src/rag/prompts/
.
- Embeddings:
snowflake-arctic-embed:335m
- Chat: choose any local model, default
llama3.1:8b
uvicorn src.rag.api:app --host 0.0.0.0 --port 8000 --reload
- POST /index → { chunks_indexed }
- POST /query { query, k?, temperature? } → { answer, citations, guardrail_flags }