Skip to content

[Epic]: Semantic Schema Search - Embeddings-based Tag Discovery and Library Schema Support #88

@neuromechanist

Description

@neuromechanist

Overview

This epic addresses a fundamental architectural improvement: integrating embeddings-based semantic search to improve HED tag discovery, enable multi-schema support, and provide better control over tag extension behavior. This is a parent issue that unifies several related feature requests.

Related Issues

Problem Statement

Current Limitations

  1. Schema Loading: Currently only the base HED schema (8.3.0) is loaded into context
  2. Library Schemas: No support for library schemas (SCORE, LANG, Mouse, etc.) which contain domain-specific vocabulary
  3. Tag Discovery: The LLM receives a flat vocabulary list without semantic relationships
  4. Extension Control: No mechanism to constrain the agent to existing tags only
  5. Context Efficiency: Loading all library schemas would bloat the system prompt and break caching

Why This Matters

Users like @bendichter report that the annotation agent often creates custom tag extensions when valid existing tags would suffice. Example:

Input: "reward_time: Reward delivery. The animal received a drop of juice..."
Expected: Sensory-event, Experiment-stimulus, Gustatory-presentation, Feedback, Reward
Got: Animal/Receiver, Item/Juice (custom extensions)

The root cause is that the LLM doesn't have semantic understanding of which existing tags are closest to the concepts being annotated.

Proposed Solution: Embeddings-Based Semantic Search

Leverage the proven approach from hed-lsp (available locally at ~/Documents/git/HED/hed-lsp), which implements:

1. Dual-Embedding Architecture

  • Tag embeddings: Pre-computed vectors for all HED tags
  • Keyword embeddings: Curated domain-specific terms (neuroscience vocabulary) that map to HED tags
  • Combined scoring gives higher confidence when evidence from both sources

2. Keyword Index

A deterministic lookup table mapping common terms to HED tags:

KEYWORD_INDEX = {
    "monkey": ["Animal", "Animal-agent"],
    "juice": ["Reward", "Drink"],
    "mouse": ["Animal", "Animal-agent", "Computer-mouse"],
    "seizure": ["sc:Seizure"],  # SCORE library
    # ... hundreds of neuroscience terms
}

3. Model

Uses Qwen3-Embedding-0.6B (ONNX quantized) for runtime semantic search. Pre-computed embeddings loaded at startup; model only needed for query-time embedding.

Benefits

Resolves #62 and #63 (Non-Extension Mode)

With semantic search, we can:

  1. Find the closest existing tags for any input term
  2. Provide a --no-extend flag that disables tag extensions entirely
  3. Show the agent which existing tags are semantically closest, making extensions unnecessary

Resolves #87 (Multi-Schema Search)

With embeddings:

  1. Keep base schema in LLM context for caching
  2. Search library schemas on-demand when relevant terms detected
  3. Pull in relevant library tags without loading entire schemas

Additional Benefits

  1. Faster annotation: Semantic hints reduce LLM reasoning iterations
  2. More consistent: Same input produces same tag suggestions
  3. Domain-aware: Keyword index contains neuroscience-specific mappings
  4. Extensible: Easy to add new library schemas by generating their embeddings

Implementation Plan

Phase 1: Core Infrastructure

  • Port embeddings.ts concepts to Python module (src/utils/semantic_search.py)
  • Create keyword index with neuroscience terminology (start with hed-lsp's KEYWORD_INDEX)
  • Pre-compute embeddings for HED 8.3.0 base schema
  • Add storage format for embeddings (JSON or numpy arrays)

Phase 2: Integration with Annotation Agent

  • Add semantic search to vocabulary lookup in annotation_agent.py
  • Include "closest tags" hints in system prompt when user input contains known keywords
  • Add --no-extend CLI flag to disable extensions entirely

Phase 3: Library Schema Support

  • Generate embeddings for library schemas (SCORE, LANG, Mouse)
  • Add schema selection options to API and CLI
  • Dynamic library tag injection based on semantic match

Phase 4: Optimization

  • Quantized model support for faster inference
  • Caching strategy for frequently-used queries
  • Performance benchmarks

Technical Details

Embedding Model Options

  1. Qwen3-Embedding-0.6B (as in hed-lsp) - 1024 dimensions, ONNX quantized ~150MB
  2. sentence-transformers alternatives - more Python-native options
  3. OpenAI/Anthropic embeddings - API-based, no local model needed

Data Structures

@dataclass
class TagEmbedding:
    tag: str           # Short form (e.g., "Building")
    long_form: str     # Full path
    prefix: str        # Library prefix (e.g., "sc:" for SCORE)
    vector: list[float]

@dataclass  
class KeywordEmbedding:
    keyword: str       # Natural language term
    targets: list[str] # HED tags this keyword points to
    vector: list[float]

Search Algorithm

def find_similar(query: str, top_k: int = 10) -> list[SemanticMatch]:
    # 1. Check keyword index first (deterministic)
    if query.lower() in KEYWORD_INDEX:
        return keyword_matches
    
    # 2. Embed query and search
    query_embedding = embed(query)
    
    # 3. Search keyword embeddings - collect votes
    keyword_votes = search_keywords(query_embedding)
    
    # 4. Search tag embeddings directly  
    direct_matches = search_tags(query_embedding)
    
    # 5. Combine evidence - boost tags with both sources
    return combine_and_rank(keyword_votes, direct_matches)

Acceptance Criteria

  • hedit annotate --no-extend "reward delivery" produces only existing tags
  • hedit annotate --schema "8.3.0,sc:2.0.0" searches both base and SCORE schemas
  • Semantic search suggests relevant tags for domain terms (e.g., "seizure" → sc:Seizure)
  • Annotation consistency improves (same inputs → same tag suggestions)
  • No significant latency increase for annotation

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: highHigh priority - important for upcoming releasetype: featureNew feature or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions