Skip to content

Add dynamic query expansion for developer abbreviations and synonyms #3

@kapillamba4

Description

@kapillamba4

Problem

Users often search with abbreviations while code uses full names:

  • Search: auth → Code: authenticate, authentication
  • Search: db → Code: database, DatabaseConnection
  • Search: config → Code: configuration, settings

Current hybrid search (queries.py:hybrid_search) passes queries directly to BM25 and the embedding model without expansion, causing missed results.

Proposed Solution

Add a query expansion layer that handles developer abbreviations and synonyms dynamically, without hard-coded mappings.

Approaches to Consider

Option A: Learn from Codebase (Recommended)

Extract abbreviation patterns from symbol names in the indexed codebase:

  • AuthService → learns authauthentication
  • DBConnection → learns dbdatabase
  • ConfigManager → learns configconfiguration

Build a dynamic abbreviation map during indexing by detecting common prefixes and compound words.

Option B: Embedding-Based Expansion

Use the existing embedding model to find semantically similar terms:

  • Embed the query
  • Find similar words/phrases from the indexed symbol names
  • Expand query with high-similarity matches

Leverages the existing jina-code-embeddings-0.5b model already in use.

Option C: NLP Stemming/Lemmatization

Use a lightweight NLP library (e.g., nltk, spaCy, or porter2stemmer) to handle:

  • Plural/singular: usersuser
  • Verb forms: authenticateauthenticatingauthenticated
  • Common English morphological variants

Implementation Location

queries.py:99-194 (hybrid_search function) - add expansion step before BM25/vector search.

Trade-offs

Approach Pro Con
Learn from codebase Domain-specific, no maintenance Requires re-scan on significant changes
Embedding-based Reuses existing model May add latency, less precise
Stemming only Fast, deterministic Misses domain-specific abbreviations

Acceptance Criteria

  • Query expansion improves recall for abbreviated searches
  • No hard-coded synonym dictionary required
  • No significant latency regression (<50ms overhead)
  • Unit tests for expansion logic
  • Integration test showing improved recall

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions