Skip to content

brunowinter8192/RAG

Repository files navigation

RAG

Local RAG pipeline for Claude Code — index PDFs and websites, search with hybrid retrieval.

Features

  • Hybrid Search — dense vectors + sparse keywords + cross-encoder reranking, fused with RRF
  • PDF→RAG Pipeline — MinerU extraction → LLM cleanup → chunking → indexing
  • Web→RAG Pipeline — crawled markdown → cleanup → chunking → indexing
  • Auto-Start — GPU embedding and reranker servers start on demand, stop after 15 minutes idle
  • Multi-Collection — multiple document collections, searchable independently or together

Quick Start

/plugin marketplace add brunowinter8192/claude-plugins
/plugin install rag
# Restart session

After install: configure plugin .env

The plugin cache contains only code. GPU binaries, models, and indexed data live in your local RAG clone. Tell the plugin where to find them:

# Point the plugin to your RAG project directory
echo "RAG_PROJECT_ROOT=/path/to/your/RAG/clone" \
  >> ~/.claude/plugins/cache/brunowinter-plugins/rag/1.0.0/.env

Replace /path/to/your/RAG/clone with the absolute path where you cloned this repo, built llama.cpp, and downloaded your models. This file survives plugin updates.

# Index a PDF:
/rag:pdf-convert /path/to/document.pdf

# Search is handled by the rag-search agent automatically

Prerequisites

  • Docker (for PostgreSQL + pgvector)
  • llama.cpp built with GPU support (Metal/CUDA)
  • Python 3.11+

You choose your own models — any llama-server-compatible GGUF works for embedding and reranking. SPLADE uses a fixed HuggingFace model that auto-downloads. All model paths and ports are configured in .env.

start.sh starts PostgreSQL and all GPU servers. Manual setup is only needed for first-time installation.

Setup

1. Clone + venv

git clone https://github.com/brunowinter8192/RAG.git
cd RAG
python -m venv venv
./venv/bin/pip install -r requirements.txt
cp .env.example .env

2. Configure .env

Edit .env to set your model paths and ports. See .env.example for all available options. Key settings:

Variable What it does
RAG_PROJECT_ROOT Absolute path to this repo (required for GPU server auto-start)
EMBEDDING_MODEL_PATH Path to your embedding GGUF model
RERANKER_MODEL_PATH Path to your reranker GGUF model
LLAMA_SERVER_PATH Path to your llama-server binary
VECTOR_DIMENSION Must match your embedding model's output dimension

Defaults point to ./models/ and ./llama.cpp/build/bin/llama-server. Ports default to 8081 (embedding), 8082 (reranker), 8083 (SPLADE).

3. Start PostgreSQL

docker compose up -d postgres

4. Build llama.cpp

cd llama.cpp
cmake -B build -DGGML_METAL=ON    # or -DGGML_CUDA=ON for NVIDIA
cmake --build build --config Release -j --target llama-server
cd ..

5. Download models

Choose any llama-server-compatible GGUF models. Example (our setup):

# Embedding (4096 dimensions)
huggingface-cli download Qwen/Qwen3-Embedding-8B-GGUF \
  Qwen3-Embedding-8B-Q8_0.gguf --local-dir ./models/

# Reranker (auto-downloads on first use if using Qwen3-Reranker-0.6B)

Update EMBEDDING_MODEL_PATH and RERANKER_MODEL_PATH in .env to match your downloaded files.

Usage

CLI Subcommands (via rag-cli)

The agent-rag-search Skill calls these automatically via the rag-cli wrapper. You can also run them directly from the terminal.

Subcommand What it does When to use
search_hybrid Hybrid semantic + keyword search with RRF fusion and cross-encoder reranking Default choice for any collection
search Pure semantic vector search Conceptual questions, no exact terms needed
search_keyword Exact term matching with stemming Technical terms, function names, identifiers
read_document Read continuous chunks from a document After search: expand context around a hit
list_collections Show all indexed collections with chunk counts Discover what's available
list_documents Show documents in a collection Inspect a collection before filtering

Skills & Commands

  • agent-rag-search Skill — Autonomous search agent. Dispatched automatically for RAG queries. Handles collection discovery, multi-query strategy, and deep reading via read_document.
  • /rag:pdf-convert — Full PDF→RAG pipeline. Extracts PDF with MinerU, cleans markdown with LLM agent, chunks and indexes into PostgreSQL. Runs in phases with stop points for verification.
  • /rag:web-md-index — Website markdown→RAG pipeline. Cleans crawled markdown (removes navigation, footers, UI chrome), then chunks and indexes.

Workflows

PDF → RAG

  1. PDF extraction (MinerU) → raw markdown
  2. LLM cleanup (md-cleanup-master agent) → clean markdown
  3. Chunking + dense/sparse embedding → PostgreSQL with pgvector
  4. Ready to search via agent-rag-search Skill

Website → RAG

  1. Crawl website (e.g. via SearXNG /crawl-site) → markdown files
  2. LLM cleanup (web-md-cleanup agent) → clean markdown
  3. Chunking + dense/sparse embedding → PostgreSQL
  4. Ready to search via agent-rag-search Skill

Troubleshooting

PostgreSQL connection refused

The database container is not running.

docker compose up -d postgres

Verify: docker ps --filter name=rag-postgres

Embedding server not responding (port 8081)

GPU servers auto-start on demand when the Skill runs a search. For manual start:

./venv/bin/python workflow.py server start

Check status:

./venv/bin/python workflow.py server status
curl -s localhost:8081/health
Reranker slow on first search

The reranker model (Qwen3-Reranker-0.6B, ~600MB) downloads on first use. Subsequent calls are fast. This is expected behavior — no action needed.

Search returns empty results
  1. Verify the collection exists: rag-cli list_collections (no GPU servers needed)
  2. Check that GPU servers are running: ./venv/bin/python workflow.py server status
  3. If servers are down: ./venv/bin/python workflow.py server start

License

MIT

About

RAG system for Claude Code — hybrid retrieval (semantic + SPLADE + RRF), cross-encoder reranking (Qwen3), pgvector, PDF/web indexing pipelines, subagent evaluation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors