Skip to content

feat: ColGREP semantic search integration + ripgrep improvements #109

@federiconeri

Description

@federiconeri

Summary

Add ColGREP as an optional semantic code search tool alongside an improved ripgrep-based search engine. Makes the baseline ripgrep search more robust first, then layers ColGREP as an optional upgrade for richer codebase understanding during scan and spec phases.

Problem / Context

Current search has several limitations:

  • Two separate ripgrep wrappers (interview-tools.ts and tools.ts) with duplicated logic
  • Restrictive result limits (3/file, 20 total, 200-char truncation, 0 context lines)
  • Missing exclusions for common noise dirs (.next, coverage, __pycache__, .turbo, etc.)
  • hasRipgrep() calls which rg synchronously on every invocation — no caching
  • Pattern-only search misses conceptual queries ("error handling middleware", "auth flow")

ColGREP is a semantic code search CLI that uses multi-vector embeddings (ColBERT) with Tree-sitter AST parsing. It wins 70% of head-to-head comparisons vs grep and reduces token usage by ~15%. It runs fully locally with a bundled 17M-param model — no API keys needed.

Proposed Solution

Architecture: Layered Search Module

src/ai/tools/search.ts      ← shared engine (executeSearch, validateSearchPath, hasRipgrep cache)
src/ai/tools/colgrep.ts     ← ColGREP detection, index management, semantic search wrapper
interview-tools.ts           ← wraps engine as search_codebase + semantic_search tools
tools.ts                     ← wraps engine as searchCode + semanticSearch tools

Two Distinct AI Tools

  • search_codebase / searchCode — ripgrep-based, for exact patterns, regex, identifiers
  • semantic_search / semanticSearch — ColGREP-based, for natural language conceptual queries
  • AI chooses which to use based on query type

ColGREP Integration

  • Detection: cached hasColgrep() binary check, stored in .ralph/ralph.config.cjs
  • Index build: during wiggum init, in parallel with AI analysis (60s timeout)
  • Index sync: quick incremental at wiggum new start (15s timeout)
  • Degradation: if not installed or index fails, silently fall back to ripgrep only

Ripgrep Improvements

  • Shared executeSearch() engine replacing two duplicated implementations
  • Comprehensive exclusion list (dirs + globs)
  • Cached hasRipgrep() — checked once per process
  • Better defaults: 5/file, 25-50 total, 500-char content, 2 context lines
  • Robust result parsing with proper regex instead of first-colon split
  • .gitignore respected (ripgrep default behavior preserved)

Files to Modify

New Files

File Purpose
src/ai/tools/search.ts Core search engine
src/ai/tools/colgrep.ts ColGREP detection, index, search
src/ai/tools/__tests__/search.test.ts Engine tests
src/ai/tools/__tests__/colgrep.test.ts ColGREP tests
src/ai/tools/__tests__/search-integration.test.ts Tool wrapper tests

Modified Files

File Changes
src/ai/tools/index.ts Export new modules
src/ai/tools.ts Replace inline ripgrep with executeSearch(), add semanticSearch
src/ai/conversation/interview-tools.ts Replace inline ripgrep/grep with executeSearch(), add semantic_search
src/ai/enhancer.ts Detect colgrep, parallel syncIndex(), pass availability
src/ai/agents/codebase-analyzer.ts Pass colgrepAvailable to tools
src/tui/orchestration/interview-orchestrator.ts Quick colgrep sync, pass availability
src/ai/conversation/spec-generator.ts Same as orchestrator for CLI path
src/utils/tui.ts Add semantic_search to TOOL_ICONS
src/tui/hooks/useSpecGenerator.ts Add format case for semantic_search
src/generator/config.ts Add tools.colgrep to config
src/ai/prompts.ts Mention semanticSearch when available

Acceptance Criteria

  • Shared executeSearch() engine with comprehensive exclusions and cached binary detection
  • ColGREP detection, index build/sync, and semantic search wrapper
  • semantic_search tool registered conditionally when ColGREP is available
  • Index built in parallel during wiggum init, synced during wiggum new
  • Graceful degradation: ColGREP optional → ripgrep always → grep fallback
  • Improved ripgrep defaults (500-char content, 2 context lines, better limits)
  • TUI displays semantic search tool calls with icon
  • Unit tests for engine, ColGREP module, and tool registration
  • ColGREP availability persisted in .ralph/ralph.config.cjs

Design Doc

docs/plans/2026-02-24-colgrep-search-improvements-design.md

Sub-issues

Metadata

Metadata

Assignees

Labels

ai/llmAI workflows, agents, promptsfeatureNew capabilityscannerProject detection, tech stack scanning

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions