-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Users often search with abbreviations while code uses full names:
- Search:
auth→ Code:authenticate,authentication - Search:
db→ Code:database,DatabaseConnection - Search:
config→ Code:configuration,settings
Current hybrid search (queries.py:hybrid_search) passes queries directly to BM25 and the embedding model without expansion, causing missed results.
Proposed Solution
Add a query expansion layer that handles developer abbreviations and synonyms dynamically, without hard-coded mappings.
Approaches to Consider
Option A: Learn from Codebase (Recommended)
Extract abbreviation patterns from symbol names in the indexed codebase:
AuthService→ learnsauth↔authenticationDBConnection→ learnsdb↔databaseConfigManager→ learnsconfig↔configuration
Build a dynamic abbreviation map during indexing by detecting common prefixes and compound words.
Option B: Embedding-Based Expansion
Use the existing embedding model to find semantically similar terms:
- Embed the query
- Find similar words/phrases from the indexed symbol names
- Expand query with high-similarity matches
Leverages the existing jina-code-embeddings-0.5b model already in use.
Option C: NLP Stemming/Lemmatization
Use a lightweight NLP library (e.g., nltk, spaCy, or porter2stemmer) to handle:
- Plural/singular:
users↔user - Verb forms:
authenticate↔authenticating↔authenticated - Common English morphological variants
Implementation Location
queries.py:99-194 (hybrid_search function) - add expansion step before BM25/vector search.
Trade-offs
| Approach | Pro | Con |
|---|---|---|
| Learn from codebase | Domain-specific, no maintenance | Requires re-scan on significant changes |
| Embedding-based | Reuses existing model | May add latency, less precise |
| Stemming only | Fast, deterministic | Misses domain-specific abbreviations |
Acceptance Criteria
- Query expansion improves recall for abbreviated searches
- No hard-coded synonym dictionary required
- No significant latency regression (<50ms overhead)
- Unit tests for expansion logic
- Integration test showing improved recall