Practical implementation of REFRAG for improved RAG systems.
Traditional RAG systems use large chunks (512-1024 tokens) and send everything to the LLM. REFRAG optimizes this with:
- Micro-chunking: 16-32 token chunks for fine-grained retrieval
- Fast indexing: Direct encoding (NO LLM calls during indexing)
- Query-time compression: Dynamic policy decides RAW vs COMPRESSED chunks
- Mixed context: High-priority chunks get full detail, others compressed to keywords
- Blazing fast indexing: No LLM overhead during indexing (seconds vs minutes)
- Fine-grained retrieval: Micro-chunks enable precise information extraction
- Smaller context windows: Query-time compression reduces token usage
- Better quality: Keep full detail for relevant chunks, compress the rest
Based on concepts from REFRAG research (arXiv:2509.01092). This implementation focuses on the core REFRAG approach: micro-chunking with query-time compression.
[Query]: "How does the transformer attention mechanism work?"
Standard RAG Context (Expensive):
┌─────────────────────────────────────────────────────────────┐
│ [Chunk 1: 512 tokens] │
│ ...full text about RNNs and sequential processing... │
│ (Irrelevant but you still pay for 512 tokens) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ [Chunk 2: 512 tokens] │
│ ...The attention mechanism computes a weighted sum... │
│ (Relevant - you need this!) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ [Chunk 3: 512 tokens] │
│ ...full text about CNNs and image processing... │
│ (Irrelevant but you still pay for 512 tokens) │
└─────────────────────────────────────────────────────────────┘
Total: ~1,536 tokens
REFRAG Context (Efficient):
┌─────────────────────────────────────────────────────────────┐
│ [COMPRESSED] RNNs sequential vanishing gradient LSTM │
│ (30 tokens - just keywords) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ [RAW] The attention mechanism computes a weighted sum of │
│ values based on query-key similarity, enabling the model... │
│ (512 tokens - full detail preserved) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ [COMPRESSED] CNNs convolution pooling image classification │
│ (25 tokens - just keywords) │
└─────────────────────────────────────────────────────────────┘
Total: ~567 tokens
Result: 63% fewer tokens, same answer quality ✨
When to use REFRAG:
- Large document collections (1000+ docs)
- Token cost is a concern
- Need precise retrieval (not just chunks)
- Fast indexing required
When traditional RAG is fine:
- Small collections (< 100 docs)
- Context window not a bottleneck
- Simplicity over optimization
from refrag import REFRAGRetriever, MicroChunker
# 1. Micro-chunk your documents
chunker = MicroChunker(chunk_size=32) # 32 tokens per chunk
chunks = chunker.chunk_documents(documents)
# 2. Fast indexing (NO LLM!)
retriever = REFRAGRetriever()
retriever.index(chunks) # Fast! Just encoder embeddings
# 3. Retrieve with query-time compression
result = retriever.retrieve_with_compression(
query="Tell me about machine learning",
top_k=10
)
# Result contains:
# - Mixed RAW + COMPRESSED context
# - Top 30% chunks: Full detail
# - Bottom 70% chunks: Keywords onlyfrom refrag import MicroChunker
chunker = MicroChunker(chunk_size=32)
chunks = chunker.chunk_text("Your document here...")
# Creates small, precise chunks for better retrievalretriever = REFRAGRetriever()
retriever.index(chunks) # Direct encoding only!
# 100x+ faster than LLM-based approachesresult = retriever.retrieve_with_compression(query, top_k=10)
# Automatically decides: RAW vs COMPRESSED per chunk
# Based on relevance scores[RAW]Python is a programming language created by Guido van Rossum.[/RAW]
[COMPRESSED]machine learning AI neural networks[/COMPRESSED]
[RAW]JavaScript runs in web browsers for interactive sites.[/RAW]
REFRAG is model-agnostic. It prepares the context before it reaches the LLM.
| Component | Support |
|---|---|
| LLMs | ✅ OpenAI (GPT-4, GPT-4o) ✅ Anthropic (Claude 3, Claude 3.5) ✅ Open-source (Llama 3, Mistral, Gemini) ✅ Any LLM API that accepts text input |
| Embeddings | ✅ Any HuggingFace sentence-transformers model ✅ OpenAI Embeddings ✅ Custom embedding models |
| Vector DBs | 🔜 FAISS (planned) 🔜 Qdrant (planned) 🔜 Weaviate (planned) |
| Frameworks | ✅ Standalone ✅ LangChain (easy integration) 🔜 LlamaIndex (planned) |
REFRAG sits between retrieval and LLM generation:
# 1. REFRAG prepares optimized context
result = retriever.retrieve_with_compression(query)
context = result['context'] # Mixed RAW + COMPRESSED
# 2. Send to ANY LLM
# OpenAI
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}]
)
# Anthropic
response = anthropic.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}]
)
# Llama (via Ollama/HuggingFace)
# Works the same way!pip install refraggit clone https://github.com/Shaivpidadi/refrag.git
cd refrag
pip install -r requirements.txt- Python 3.8+
- sentence-transformers >= 2.2.0
- transformers >= 4.30.0
- torch >= 2.0.0
- numpy >= 1.21.0
Note: No OpenAI/Anthropic API keys needed! This implementation doesn't use LLMs during indexing.
REFRAG consists of 5 core components:
┌─────────────────────────────────────────────────────────┐
│ 1. MicroChunker: 16-32 token chunks │
│ - Token-based (not character-based) │
│ - Fine-grained retrieval │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 2. FastEncoder: Direct embedding (NO LLM!) │
│ - sentence-transformers model │
│ - Seconds to index, not minutes │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 3. CompressionPolicy: Decide RAW vs COMPRESSED │
│ - Query-time decisions (not pre-compression) │
│ - Based on similarity scores │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 4. ChunkCompressor: Extract keywords │
│ - For low-priority chunks │
│ - Fast heuristic-based │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 5. MixedContextDecoder: Build final context │
│ - Combines RAW + COMPRESSED │
│ - Ready for LLM input │
└─────────────────────────────────────────────────────────┘
We benchmarked REFRAG against standard RAG on the HotpotQA dataset (Wikipedia question-answering) with 49,691 real documents.
REFRAG and standard RAG both use fast direct encoding (no LLM during indexing). The difference is in how they use context.
Example benchmark results (HotpotQA dataset, 49,691 documents):
| Metric | Standard RAG | REFRAG | Improvement |
|---|---|---|---|
| Chunk Strategy | Large chunks (512 tokens) | Micro-chunks (32 tokens) | Fine-grained retrieval |
| Compression | None | Query-time adaptive | Smart context |
| Indexing Speed | Fast (direct encoding) | Fast (direct encoding) | Same |
| Avg Tokens to LLM | ~177 tokens/query | ~83 tokens/query | 53% reduction |
| Retrieval Speed | 62.4ms/query | 22.5ms/query | 2.8x faster |
| LLM API Cost | Baseline | 53% lower | $$ Savings |
| Retrieval Quality | Good | Good | Same |
Indexing (Both Fast):
- Standard RAG: ~60s to index
- REFRAG: ~58s to index
- Both use direct encoding (no LLM calls)
Token Efficiency (REFRAG's Strength):
- Standard RAG: ~177 tokens/query sent to LLM
- REFRAG: ~83 tokens/query sent to LLM
- 53% fewer tokens = 53% cost savings on every query
all-MiniLM-L6-v2 encoder, HotpotQA dataset with 5,000 samples). Your results may vary based on hardware, dataset, and configuration.
Run your own benchmark to see actual performance on your data:
pip install datasets
PYTHONPATH=. python examples/compare_with_vanilla_rag.pyThe benchmark script calculates and reports actual measured values (not hardcoded claims). Results will vary based on your dataset and hardware.
| Feature | Standard RAG | REFRAG |
|---|---|---|
| Chunk size | 512-1024 tokens | 16-32 tokens (micro) |
| Indexing | Direct encoding | Direct encoding |
| Indexing speed | Fast | Same (fast) |
| Compression | None | Query-time adaptive |
| Context format | All chunks same | Mixed RAW/COMPRESSED |
| Tokens to LLM | ~177/query | ~83/query |
| Token efficiency | Baseline | 53% better |
| LLM API cost | Baseline | 53% lower |
| Retrieval precision | Chunk-level | Token-level |
| Tested on | - | 49,691 real docs |
You can customize the underlying components to fit your specific needs:
from refrag import REFRAGRetriever, MicroChunker, CompressionPolicy, ChunkCompressor
# 1. Change Chunk Size (Standard is 16-32)
chunker = MicroChunker(chunk_size=64) # Larger chunks for longer contexts
chunks = chunker.chunk_documents(documents)
# 2. Change Encoder Model (Supports any HuggingFace sentence-transformers model)
# Use a larger, more accurate model if needed
retriever = REFRAGRetriever(
embedding_model="BAAI/bge-small-en-v1.5" # Or "all-mpnet-base-v2", etc.
)
# Or with GPU support:
retriever = REFRAGRetriever(
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
device="cuda" # or "cpu", "mps" for Apple Silicon
)
# 3. Adjust Compression Aggressiveness
# raw_percentage: 0.0-1.0 (Higher = more chunks kept as RAW)
from refrag import CompressionPolicy
policy = CompressionPolicy(
raw_percentage=0.4, # Keep top 40% as RAW (default: 0.3)
min_raw_chunks=3, # Always keep at least 3 RAW (default: 2)
similarity_threshold=0.6 # Minimum score for RAW consideration
)
retriever = REFRAGRetriever(compression_policy=policy)
# 4. Custom Compression Method
from refrag import ChunkCompressor
compressor = ChunkCompressor(
compression_method="keywords", # "keywords", "entities", or "first_n"
max_keywords=10 # More keywords = better context (default: 5)
)
retriever = REFRAGRetriever(compressor=compressor)
# 5. Custom Context Format
from refrag import MixedContextDecoder
decoder = MixedContextDecoder(
format_style="separated" # "tagged", "separated", or "inline"
)
retriever = REFRAGRetriever(decoder=decoder)
# 6. Combine Everything
retriever = REFRAGRetriever(
embedding_model="BAAI/bge-large-en-v1.5",
compression_policy=CompressionPolicy(raw_percentage=0.5),
compressor=ChunkCompressor(compression_method="entities", max_keywords=8),
decoder=MixedContextDecoder(format_style="inline")
)# For speed (smaller model, more compression)
retriever = REFRAGRetriever(
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Fast
compression_policy=CompressionPolicy(raw_percentage=0.2), # Compress 80%
device="cuda" # GPU acceleration
)
# For quality (larger model, less compression)
retriever = REFRAGRetriever(
embedding_model="BAAI/bge-large-en-v1.5", # High accuracy
compression_policy=CompressionPolicy(raw_percentage=0.5), # Keep 50% RAW
)
# For token efficiency (maximum compression)
retriever = REFRAGRetriever(
compression_policy=CompressionPolicy(raw_percentage=0.1), # Keep only 10% RAW
compressor=ChunkCompressor(max_keywords=3) # Minimal keywords
)See examples/basic_usage.py for a complete working example.
Run the comprehensive benchmark using real-world data:
pip install datasets # Install HuggingFace datasets
PYTHONPATH=. python examples/compare_with_vanilla_rag.pyResults on 49,691 Wikipedia documents (HotpotQA):
- Dataset: 5,000 samples → 49,691 documents → 208,081 micro-chunks
- Indexing: Same speed as standard RAG (~58s, both use direct encoding)
- Token reduction: 53% fewer tokens sent to LLM per query
- Cost savings: 53% reduction in LLM API costs
- Retrieval speed: 2.8x faster with compression
- Quality: Same accuracy as standard RAG, better context efficiency
See HOTPOTQA_BENCHMARK_RESULTS.md for detailed analysis.
- Split documents into 16-32 token micro-chunks
- Encode directly with sentence-transformers
- Store embeddings + original chunks
- No LLM calls = blazing fast!
- Embed query
- Find top-k similar chunks via vector search
- Apply compression policy (decide RAW vs COMPRESSED)
- Compress low-priority chunks to keywords
- Build mixed context for LLM
- Micro-chunks: Precise retrieval at sub-document level
- No LLM during indexing: 100x+ speed improvement
- Query-time compression: Adaptive based on relevance
- Mixed context: Best of both worlds (detail + coverage)
The original REFRAG paper performs compression in Vector Space (injecting raw embeddings into the LLM).
Since commercial APIs (GPT-4, Claude) do not allow vector injection, this library adapts the architecture to Text Space:
- Vector Space (Paper):
[Vector_A] [Vector_B]→ LLM - Text Space (This Repo):
[Keywords_A] [Keywords_B]→ LLM
This allows you to get the token-saving benefits of REFRAG on standard APIs without needing your own GPU cluster.
- Uses heuristic-based compression policy (not RL-based like paper)
- English-only keyword extraction (stopwords hardcoded)
- No vector database integration yet (in-memory only)
- Text-space compression (not vector-space like original paper)
- RL-based compression policy training
- FAISS/Qdrant integration for large-scale deployments
- Multi-language support (non-English stopwords)
- Streaming/incremental indexing
- Vector-space compression (requires custom LLM)
- Built-in reranking support
- LlamaIndex/LangChain integration
- Very small chunks (< 16 tokens) may lose context
- Compression quality varies by domain (technical docs work best)
- Capitalized-word heuristic may miss important lowercase keywords
This implementation is based on the following paper:
@misc{lin2025refragrethinkingragbased,
title={REFRAG: Rethinking RAG based Decoding},
author={Xiaoqiang Lin and Aritra Ghosh and Bryan Kian Hsiang Low and Anshumali Shrivastava and Vijai Mohan},
year={2025},
eprint={2509.01092},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.01092},
}If you use this implementation in your research, please cite both the original paper and this repository.
Based on REFRAG research by Meta AI. This is an independent implementation for the open-source community.
Disclaimer: This is not an official Meta product. For the official implementation, please refer to Meta's repositories.