Scythe Context Engine

A high-performance code repository indexing and retrieval system using metadata-only RAG (Retrieval-Augmented Generation) for efficient semantic search.

Overview

Scythe Context Engine indexes code repositories by extracting functions, classes, and other code structures, then creates searchable embeddings based on metadata rather than full code. This approach significantly reduces embedding costs and improves retrieval speed while maintaining high-quality results.

Key Features

Metadata-Only RAG: Embeddings are created from function names, docstrings, and AI-generated summaries instead of full code
Efficient Storage: Full code is stored separately on disk and loaded only when needed
Multi-Language Support: Python, JavaScript, TypeScript, Java, C/C++, Go, and Rust
Smart Reranking: LLM-based reranking of search results for improved relevance
Semantic Caching: Caches refined context to speed up repeated queries
Parallel Processing: Multi-threaded indexing and embedding for fast processing
Batch Processing: Optional Groq Batch API support for cost-effective indexing (up to 50% cost reduction)

Architecture

Indexing Pipeline

File Collection: Scans repository for supported code files
AST Parsing: Uses tree-sitter to extract functions and classes
Metadata Extraction: Extracts function names, docstrings, and line numbers
Summarization: Generates AI summaries of each function
Chunk Storage: Saves full code to full_chunks/ directory
Embedding: Creates embeddings from metadata (name + docstring + summary)
Index Creation: Builds FAISS vector index for fast similarity search

Query Pipeline

Query Embedding: Converts search query to vector
Initial Retrieval: Finds top-k similar chunks using FAISS
Reranking: LLM scores chunks based on metadata relevance
Code Loading: Loads full code from disk for top-ranked chunks
Context Refinement: LLM extracts essential context for the query
Caching: Stores refined context for future identical queries

Installation

uv pip install -e .

Usage

Indexing a Repository

Standard (Real-time) Indexing:

uv run python index_repo.py /path/to/repo --output repo_index

Batch Indexing (Cost-Effective for Large Repos):

uv run python index_repo.py /path/to/repo --output repo_index --batch

The --batch flag uses Groq's Batch API for summarization, which:

Reduces costs by up to 50%
Takes longer (minutes to hours depending on batch completion window)
Is ideal for initial indexing of large repositories

See Groq Batch Usage Guide for details.

Output Files:

repo_index/index.faiss - FAISS vector index
repo_index/chunks.pkl - Chunk metadata
repo_index/meta.json - Index metadata
repo_index/full_chunks/ - Directory containing full code for each chunk

Querying the Index

uv run python query_context.py "your search query" --index repo_index

Options:

--top-k N - Number of chunks to retrieve initially (default: 20)
--output-k N - Number of chunks in final output (default: 5)
--no-cache - Disable semantic caching

Configuration

Edit config/config.py to configure:

Provider: Choose between openrouter or ollama
Models: Set embedding and summarization models
API Keys: Configure OpenRouter API key
Ignored Paths: Customize which directories/files to skip during indexing

Breaking Changes (v2.0)

This version introduces a breaking change in how chunks are stored and retrieved.

What Changed

Old Behavior: Full code was embedded and stored in the vector index
New Behavior: Only metadata (function name, docstring, summary) is embedded; full code is stored separately

Migration

If you have existing indexes, you must re-index your repositories:

# Delete old index
rm -rf repo_index/

# Re-index with new system
uv run python index_repo.py /path/to/repo --output repo_index

Why This Change

Cost Reduction: Embedding metadata is 10-100x cheaper than embedding full code
Better Retrieval: Metadata provides clearer semantic signals for matching
Flexibility: Full code can be loaded selectively, reducing memory usage
Scalability: Enables indexing of much larger codebases

Data Models

FunctionMetadata

Each code chunk has the following metadata:

chunk_id: Unique identifier (hash of file path + line numbers)
function_name: Name of the function/class
file_path: Relative path to source file
start_line: Starting line number
end_line: Ending line number
docstring: Extracted docstring (if available)
summary: AI-generated summary of the function
full_code_path: Path to the stored full code file
node_type: AST node type (e.g., function_definition, class_definition)

Performance

Indexing Speed

~100-500 files/minute (depends on file size and model speed)
Parallel processing with 8 worker threads for file processing
Parallel embedding with 32 worker threads

Query Speed

Initial retrieval: <100ms (FAISS search)
Reranking: 1-3s (LLM scoring)
Context refinement: 2-5s (LLM extraction)
Cache hit: <10ms

Advanced Usage

Custom Summarization

The summarization prompt can be customized in indexer/summarizer.py:

def summarize_function(code: str, function_name: str, file_path: str) -> str:
    prompt = f"""Your custom prompt here..."""
    # ... rest of function

Custom Chunk Storage

Chunk storage logic is in indexer/chunk_storage.py. You can modify:

generate_chunk_id() - Change how chunk IDs are generated
save_full_chunk() - Change storage format or location
load_full_chunk() - Change loading logic

Custom Reranking

Reranking logic is in query_context/reranking.py:

_build_rerank_prompt() - Customize the reranking prompt
_score_chunks_with_model() - Change scoring logic

Troubleshooting

Issue: "Chunk not found" errors during query

Solution: Ensure the index was created with the same output prefix you're querying

Issue: Slow indexing

Solution:

Reduce number of worker threads in file_processor.py
Use a faster summarization model
Skip summarization for small functions

Issue: Poor search results

Solution:

Increase --top-k to retrieve more candidates
Adjust the similarity threshold in query_context/query.py (line 376)
Use a better embedding model

Contributing

This is a personal project, but suggestions and bug reports are welcome.

License

MIT License - See LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
docs		docs
indexer		indexer
mcp_server		mcp_server
query_context		query_context
.gitignore		.gitignore
README.md		README.md
cache.py		cache.py
debug_reader.py		debug_reader.py
groq_batch_client.py		groq_batch_client.py
groq_batch_formatter.py		groq_batch_formatter.py
index_repo.py		index_repo.py
openrouter_client.py		openrouter_client.py
pyproject.toml		pyproject.toml
query_context.py		query_context.py

DarkEden-coding/Scythe-Context-Engine

Folders and files

Latest commit

History

Repository files navigation