Local RAG pipeline for Claude Code — index PDFs and websites, search with hybrid retrieval.
- Hybrid Search — dense vectors + sparse keywords + cross-encoder reranking, fused with RRF
- PDF→RAG Pipeline — MinerU extraction → LLM cleanup → chunking → indexing
- Web→RAG Pipeline — crawled markdown → cleanup → chunking → indexing
- Auto-Start — GPU embedding and reranker servers start on demand, stop after 15 minutes idle
- Multi-Collection — multiple document collections, searchable independently or together
/plugin marketplace add brunowinter8192/claude-plugins
/plugin install rag
# Restart session
After install: configure plugin .env
The plugin cache contains only code. GPU binaries, models, and indexed data live in your local RAG clone. Tell the plugin where to find them:
# Point the plugin to your RAG project directory
echo "RAG_PROJECT_ROOT=/path/to/your/RAG/clone" \
>> ~/.claude/plugins/cache/brunowinter-plugins/rag/1.0.0/.envReplace /path/to/your/RAG/clone with the absolute path where you cloned this repo, built llama.cpp, and downloaded your models. This file survives plugin updates.
# Index a PDF:
/rag:pdf-convert /path/to/document.pdf
# Search is handled by the rag-search agent automatically
- Docker (for PostgreSQL + pgvector)
- llama.cpp built with GPU support (Metal/CUDA)
- Python 3.11+
You choose your own models — any llama-server-compatible GGUF works for embedding and reranking. SPLADE uses a fixed HuggingFace model that auto-downloads. All model paths and ports are configured in .env.
start.sh starts PostgreSQL and all GPU servers. Manual setup is only needed for first-time installation.
1. Clone + venv
git clone https://github.com/brunowinter8192/RAG.git
cd RAG
python -m venv venv
./venv/bin/pip install -r requirements.txt
cp .env.example .env2. Configure .env
Edit .env to set your model paths and ports. See .env.example for all available options. Key settings:
| Variable | What it does |
|---|---|
RAG_PROJECT_ROOT |
Absolute path to this repo (required for GPU server auto-start) |
EMBEDDING_MODEL_PATH |
Path to your embedding GGUF model |
RERANKER_MODEL_PATH |
Path to your reranker GGUF model |
LLAMA_SERVER_PATH |
Path to your llama-server binary |
VECTOR_DIMENSION |
Must match your embedding model's output dimension |
Defaults point to ./models/ and ./llama.cpp/build/bin/llama-server. Ports default to 8081 (embedding), 8082 (reranker), 8083 (SPLADE).
3. Start PostgreSQL
docker compose up -d postgres4. Build llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON for NVIDIA
cmake --build build --config Release -j --target llama-server
cd ..5. Download models
Choose any llama-server-compatible GGUF models. Example (our setup):
# Embedding (4096 dimensions)
huggingface-cli download Qwen/Qwen3-Embedding-8B-GGUF \
Qwen3-Embedding-8B-Q8_0.gguf --local-dir ./models/
# Reranker (auto-downloads on first use if using Qwen3-Reranker-0.6B)Update EMBEDDING_MODEL_PATH and RERANKER_MODEL_PATH in .env to match your downloaded files.
The agent-rag-search Skill calls these automatically via the rag-cli wrapper. You can also run them directly from the terminal.
| Subcommand | What it does | When to use |
|---|---|---|
search_hybrid |
Hybrid semantic + keyword search with RRF fusion and cross-encoder reranking | Default choice for any collection |
search |
Pure semantic vector search | Conceptual questions, no exact terms needed |
search_keyword |
Exact term matching with stemming | Technical terms, function names, identifiers |
read_document |
Read continuous chunks from a document | After search: expand context around a hit |
list_collections |
Show all indexed collections with chunk counts | Discover what's available |
list_documents |
Show documents in a collection | Inspect a collection before filtering |
- agent-rag-search Skill — Autonomous search agent. Dispatched automatically for RAG queries. Handles collection discovery, multi-query strategy, and deep reading via
read_document. /rag:pdf-convert— Full PDF→RAG pipeline. Extracts PDF with MinerU, cleans markdown with LLM agent, chunks and indexes into PostgreSQL. Runs in phases with stop points for verification./rag:web-md-index— Website markdown→RAG pipeline. Cleans crawled markdown (removes navigation, footers, UI chrome), then chunks and indexes.
PDF → RAG
- PDF extraction (MinerU) → raw markdown
- LLM cleanup (md-cleanup-master agent) → clean markdown
- Chunking + dense/sparse embedding → PostgreSQL with pgvector
- Ready to search via
agent-rag-searchSkill
Website → RAG
- Crawl website (e.g. via SearXNG
/crawl-site) → markdown files - LLM cleanup (web-md-cleanup agent) → clean markdown
- Chunking + dense/sparse embedding → PostgreSQL
- Ready to search via
agent-rag-searchSkill
PostgreSQL connection refused
The database container is not running.
docker compose up -d postgresVerify: docker ps --filter name=rag-postgres
Embedding server not responding (port 8081)
GPU servers auto-start on demand when the Skill runs a search. For manual start:
./venv/bin/python workflow.py server startCheck status:
./venv/bin/python workflow.py server status
curl -s localhost:8081/healthReranker slow on first search
The reranker model (Qwen3-Reranker-0.6B, ~600MB) downloads on first use. Subsequent calls are fast. This is expected behavior — no action needed.
Search returns empty results
- Verify the collection exists:
rag-cli list_collections(no GPU servers needed) - Check that GPU servers are running:
./venv/bin/python workflow.py server status - If servers are down:
./venv/bin/python workflow.py server start
MIT