Skip to content

DarkEden-coding/Scythe-Context-Engine

Repository files navigation

Scythe Context Engine

A high-performance code repository indexing and retrieval system using metadata-only RAG (Retrieval-Augmented Generation) for efficient semantic search.

Overview

Scythe Context Engine indexes code repositories by extracting functions, classes, and other code structures, then creates searchable embeddings based on metadata rather than full code. This approach significantly reduces embedding costs and improves retrieval speed while maintaining high-quality results.

Key Features

  • Metadata-Only RAG: Embeddings are created from function names, docstrings, and AI-generated summaries instead of full code
  • Efficient Storage: Full code is stored separately on disk and loaded only when needed
  • Multi-Language Support: Python, JavaScript, TypeScript, Java, C/C++, Go, and Rust
  • Smart Reranking: LLM-based reranking of search results for improved relevance
  • Semantic Caching: Caches refined context to speed up repeated queries
  • Parallel Processing: Multi-threaded indexing and embedding for fast processing
  • Batch Processing: Optional Groq Batch API support for cost-effective indexing (up to 50% cost reduction)

Architecture

Indexing Pipeline

  1. File Collection: Scans repository for supported code files
  2. AST Parsing: Uses tree-sitter to extract functions and classes
  3. Metadata Extraction: Extracts function names, docstrings, and line numbers
  4. Summarization: Generates AI summaries of each function
  5. Chunk Storage: Saves full code to full_chunks/ directory
  6. Embedding: Creates embeddings from metadata (name + docstring + summary)
  7. Index Creation: Builds FAISS vector index for fast similarity search

Query Pipeline

  1. Query Embedding: Converts search query to vector
  2. Initial Retrieval: Finds top-k similar chunks using FAISS
  3. Reranking: LLM scores chunks based on metadata relevance
  4. Code Loading: Loads full code from disk for top-ranked chunks
  5. Context Refinement: LLM extracts essential context for the query
  6. Caching: Stores refined context for future identical queries

Installation

uv pip install -e .

Usage

Indexing a Repository

Standard (Real-time) Indexing:

uv run python index_repo.py /path/to/repo --output repo_index

Batch Indexing (Cost-Effective for Large Repos):

uv run python index_repo.py /path/to/repo --output repo_index --batch

The --batch flag uses Groq's Batch API for summarization, which:

  • Reduces costs by up to 50%
  • Takes longer (minutes to hours depending on batch completion window)
  • Is ideal for initial indexing of large repositories

See Groq Batch Usage Guide for details.

Output Files:

  • repo_index/index.faiss - FAISS vector index
  • repo_index/chunks.pkl - Chunk metadata
  • repo_index/meta.json - Index metadata
  • repo_index/full_chunks/ - Directory containing full code for each chunk

Querying the Index

uv run python query_context.py "your search query" --index repo_index

Options:

  • --top-k N - Number of chunks to retrieve initially (default: 20)
  • --output-k N - Number of chunks in final output (default: 5)
  • --no-cache - Disable semantic caching

Configuration

Edit config/config.py to configure:

  • Provider: Choose between openrouter or ollama
  • Models: Set embedding and summarization models
  • API Keys: Configure OpenRouter API key
  • Ignored Paths: Customize which directories/files to skip during indexing

Breaking Changes (v2.0)

This version introduces a breaking change in how chunks are stored and retrieved.

What Changed

  • Old Behavior: Full code was embedded and stored in the vector index
  • New Behavior: Only metadata (function name, docstring, summary) is embedded; full code is stored separately

Migration

If you have existing indexes, you must re-index your repositories:

# Delete old index
rm -rf repo_index/

# Re-index with new system
uv run python index_repo.py /path/to/repo --output repo_index

Why This Change

  1. Cost Reduction: Embedding metadata is 10-100x cheaper than embedding full code
  2. Better Retrieval: Metadata provides clearer semantic signals for matching
  3. Flexibility: Full code can be loaded selectively, reducing memory usage
  4. Scalability: Enables indexing of much larger codebases

Data Models

FunctionMetadata

Each code chunk has the following metadata:

  • chunk_id: Unique identifier (hash of file path + line numbers)
  • function_name: Name of the function/class
  • file_path: Relative path to source file
  • start_line: Starting line number
  • end_line: Ending line number
  • docstring: Extracted docstring (if available)
  • summary: AI-generated summary of the function
  • full_code_path: Path to the stored full code file
  • node_type: AST node type (e.g., function_definition, class_definition)

Performance

Indexing Speed

  • ~100-500 files/minute (depends on file size and model speed)
  • Parallel processing with 8 worker threads for file processing
  • Parallel embedding with 32 worker threads

Query Speed

  • Initial retrieval: <100ms (FAISS search)
  • Reranking: 1-3s (LLM scoring)
  • Context refinement: 2-5s (LLM extraction)
  • Cache hit: <10ms

Advanced Usage

Custom Summarization

The summarization prompt can be customized in indexer/summarizer.py:

def summarize_function(code: str, function_name: str, file_path: str) -> str:
    prompt = f"""Your custom prompt here..."""
    # ... rest of function

Custom Chunk Storage

Chunk storage logic is in indexer/chunk_storage.py. You can modify:

  • generate_chunk_id() - Change how chunk IDs are generated
  • save_full_chunk() - Change storage format or location
  • load_full_chunk() - Change loading logic

Custom Reranking

Reranking logic is in query_context/reranking.py:

  • _build_rerank_prompt() - Customize the reranking prompt
  • _score_chunks_with_model() - Change scoring logic

Troubleshooting

Issue: "Chunk not found" errors during query

Solution: Ensure the index was created with the same output prefix you're querying

Issue: Slow indexing

Solution:

  • Reduce number of worker threads in file_processor.py
  • Use a faster summarization model
  • Skip summarization for small functions

Issue: Poor search results

Solution:

  • Increase --top-k to retrieve more candidates
  • Adjust the similarity threshold in query_context/query.py (line 376)
  • Use a better embedding model

Contributing

This is a personal project, but suggestions and bug reports are welcome.

License

MIT License - See LICENSE file for details

About

A system that gathers accurate context from large codebases or info-dumps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages