-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Overview
This epic addresses a fundamental architectural improvement: integrating embeddings-based semantic search to improve HED tag discovery, enable multi-schema support, and provide better control over tag extension behavior. This is a parent issue that unifies several related feature requests.
Related Issues
- Support non-extension mode #62 - Support non-extension mode (find closest existing tags)
- [Feature]: Add options for customized prompts, for not allowing tag extension and summarization #63 - Options for not allowing tag extension
- [Feature]: ability to search multiple schemas #87 - Ability to search multiple schemas (core + library schemas)
Problem Statement
Current Limitations
- Schema Loading: Currently only the base HED schema (8.3.0) is loaded into context
- Library Schemas: No support for library schemas (SCORE, LANG, Mouse, etc.) which contain domain-specific vocabulary
- Tag Discovery: The LLM receives a flat vocabulary list without semantic relationships
- Extension Control: No mechanism to constrain the agent to existing tags only
- Context Efficiency: Loading all library schemas would bloat the system prompt and break caching
Why This Matters
Users like @bendichter report that the annotation agent often creates custom tag extensions when valid existing tags would suffice. Example:
Input: "reward_time: Reward delivery. The animal received a drop of juice..."
Expected: Sensory-event, Experiment-stimulus, Gustatory-presentation, Feedback, Reward
Got: Animal/Receiver, Item/Juice (custom extensions)
The root cause is that the LLM doesn't have semantic understanding of which existing tags are closest to the concepts being annotated.
Proposed Solution: Embeddings-Based Semantic Search
Leverage the proven approach from hed-lsp (available locally at ~/Documents/git/HED/hed-lsp), which implements:
1. Dual-Embedding Architecture
- Tag embeddings: Pre-computed vectors for all HED tags
- Keyword embeddings: Curated domain-specific terms (neuroscience vocabulary) that map to HED tags
- Combined scoring gives higher confidence when evidence from both sources
2. Keyword Index
A deterministic lookup table mapping common terms to HED tags:
KEYWORD_INDEX = {
"monkey": ["Animal", "Animal-agent"],
"juice": ["Reward", "Drink"],
"mouse": ["Animal", "Animal-agent", "Computer-mouse"],
"seizure": ["sc:Seizure"], # SCORE library
# ... hundreds of neuroscience terms
}3. Model
Uses Qwen3-Embedding-0.6B (ONNX quantized) for runtime semantic search. Pre-computed embeddings loaded at startup; model only needed for query-time embedding.
Benefits
Resolves #62 and #63 (Non-Extension Mode)
With semantic search, we can:
- Find the closest existing tags for any input term
- Provide a
--no-extendflag that disables tag extensions entirely - Show the agent which existing tags are semantically closest, making extensions unnecessary
Resolves #87 (Multi-Schema Search)
With embeddings:
- Keep base schema in LLM context for caching
- Search library schemas on-demand when relevant terms detected
- Pull in relevant library tags without loading entire schemas
Additional Benefits
- Faster annotation: Semantic hints reduce LLM reasoning iterations
- More consistent: Same input produces same tag suggestions
- Domain-aware: Keyword index contains neuroscience-specific mappings
- Extensible: Easy to add new library schemas by generating their embeddings
Implementation Plan
Phase 1: Core Infrastructure
- Port
embeddings.tsconcepts to Python module (src/utils/semantic_search.py) - Create keyword index with neuroscience terminology (start with hed-lsp's KEYWORD_INDEX)
- Pre-compute embeddings for HED 8.3.0 base schema
- Add storage format for embeddings (JSON or numpy arrays)
Phase 2: Integration with Annotation Agent
- Add semantic search to vocabulary lookup in
annotation_agent.py - Include "closest tags" hints in system prompt when user input contains known keywords
- Add
--no-extendCLI flag to disable extensions entirely
Phase 3: Library Schema Support
- Generate embeddings for library schemas (SCORE, LANG, Mouse)
- Add schema selection options to API and CLI
- Dynamic library tag injection based on semantic match
Phase 4: Optimization
- Quantized model support for faster inference
- Caching strategy for frequently-used queries
- Performance benchmarks
Technical Details
Embedding Model Options
- Qwen3-Embedding-0.6B (as in hed-lsp) - 1024 dimensions, ONNX quantized ~150MB
- sentence-transformers alternatives - more Python-native options
- OpenAI/Anthropic embeddings - API-based, no local model needed
Data Structures
@dataclass
class TagEmbedding:
tag: str # Short form (e.g., "Building")
long_form: str # Full path
prefix: str # Library prefix (e.g., "sc:" for SCORE)
vector: list[float]
@dataclass
class KeywordEmbedding:
keyword: str # Natural language term
targets: list[str] # HED tags this keyword points to
vector: list[float]Search Algorithm
def find_similar(query: str, top_k: int = 10) -> list[SemanticMatch]:
# 1. Check keyword index first (deterministic)
if query.lower() in KEYWORD_INDEX:
return keyword_matches
# 2. Embed query and search
query_embedding = embed(query)
# 3. Search keyword embeddings - collect votes
keyword_votes = search_keywords(query_embedding)
# 4. Search tag embeddings directly
direct_matches = search_tags(query_embedding)
# 5. Combine evidence - boost tags with both sources
return combine_and_rank(keyword_votes, direct_matches)Acceptance Criteria
-
hedit annotate --no-extend "reward delivery"produces only existing tags -
hedit annotate --schema "8.3.0,sc:2.0.0"searches both base and SCORE schemas - Semantic search suggests relevant tags for domain terms (e.g., "seizure" → sc:Seizure)
- Annotation consistency improves (same inputs → same tag suggestions)
- No significant latency increase for annotation
References
- hed-lsp embeddings.ts - TypeScript implementation
- hed-lsp KEYWORD_INDEX - Curated keyword mappings
- HED Schema vocabularies: https://github.com/hed-standard/hed-schemas