[Epic]: Semantic Schema Search - Embeddings-based Tag Discovery and Library Schema Support

## Overview

This epic addresses a fundamental architectural improvement: integrating embeddings-based semantic search to improve HED tag discovery, enable multi-schema support, and provide better control over tag extension behavior. This is a parent issue that unifies several related feature requests.

### Related Issues
- #62 - Support non-extension mode (find closest existing tags)
- #63 - Options for not allowing tag extension
- #87 - Ability to search multiple schemas (core + library schemas)

## Problem Statement

### Current Limitations

1. **Schema Loading**: Currently only the base HED schema (8.3.0) is loaded into context
2. **Library Schemas**: No support for library schemas (SCORE, LANG, Mouse, etc.) which contain domain-specific vocabulary
3. **Tag Discovery**: The LLM receives a flat vocabulary list without semantic relationships
4. **Extension Control**: No mechanism to constrain the agent to existing tags only
5. **Context Efficiency**: Loading all library schemas would bloat the system prompt and break caching

### Why This Matters

Users like @bendichter report that the annotation agent often creates custom tag extensions when valid existing tags would suffice. Example:
```
Input: "reward_time: Reward delivery. The animal received a drop of juice..."
Expected: Sensory-event, Experiment-stimulus, Gustatory-presentation, Feedback, Reward
Got: Animal/Receiver, Item/Juice (custom extensions)
```

The root cause is that the LLM doesn't have semantic understanding of which existing tags are closest to the concepts being annotated.

## Proposed Solution: Embeddings-Based Semantic Search

Leverage the proven approach from [hed-lsp](https://github.com/hed-standard/hed-lsp) (available locally at `~/Documents/git/HED/hed-lsp`), which implements:

### 1. Dual-Embedding Architecture
- **Tag embeddings**: Pre-computed vectors for all HED tags
- **Keyword embeddings**: Curated domain-specific terms (neuroscience vocabulary) that map to HED tags
- Combined scoring gives higher confidence when evidence from both sources

### 2. Keyword Index
A deterministic lookup table mapping common terms to HED tags:
```python
KEYWORD_INDEX = {
    "monkey": ["Animal", "Animal-agent"],
    "juice": ["Reward", "Drink"],
    "mouse": ["Animal", "Animal-agent", "Computer-mouse"],
    "seizure": ["sc:Seizure"],  # SCORE library
    # ... hundreds of neuroscience terms
}
```

### 3. Model
Uses Qwen3-Embedding-0.6B (ONNX quantized) for runtime semantic search. Pre-computed embeddings loaded at startup; model only needed for query-time embedding.

## Benefits

### Resolves #62 and #63 (Non-Extension Mode)
With semantic search, we can:
1. Find the closest existing tags for any input term
2. Provide a `--no-extend` flag that disables tag extensions entirely
3. Show the agent which existing tags are semantically closest, making extensions unnecessary

### Resolves #87 (Multi-Schema Search)
With embeddings:
1. Keep base schema in LLM context for caching
2. Search library schemas on-demand when relevant terms detected
3. Pull in relevant library tags without loading entire schemas

### Additional Benefits
1. **Faster annotation**: Semantic hints reduce LLM reasoning iterations
2. **More consistent**: Same input produces same tag suggestions
3. **Domain-aware**: Keyword index contains neuroscience-specific mappings
4. **Extensible**: Easy to add new library schemas by generating their embeddings

## Implementation Plan

### Phase 1: Core Infrastructure
- [ ] Port `embeddings.ts` concepts to Python module (`src/utils/semantic_search.py`)
- [ ] Create keyword index with neuroscience terminology (start with hed-lsp's KEYWORD_INDEX)
- [ ] Pre-compute embeddings for HED 8.3.0 base schema
- [ ] Add storage format for embeddings (JSON or numpy arrays)

### Phase 2: Integration with Annotation Agent
- [ ] Add semantic search to vocabulary lookup in `annotation_agent.py`
- [ ] Include "closest tags" hints in system prompt when user input contains known keywords
- [ ] Add `--no-extend` CLI flag to disable extensions entirely

### Phase 3: Library Schema Support
- [ ] Generate embeddings for library schemas (SCORE, LANG, Mouse)
- [ ] Add schema selection options to API and CLI
- [ ] Dynamic library tag injection based on semantic match

### Phase 4: Optimization
- [ ] Quantized model support for faster inference
- [ ] Caching strategy for frequently-used queries
- [ ] Performance benchmarks

## Technical Details

### Embedding Model Options
1. **Qwen3-Embedding-0.6B** (as in hed-lsp) - 1024 dimensions, ONNX quantized ~150MB
2. **sentence-transformers** alternatives - more Python-native options
3. **OpenAI/Anthropic embeddings** - API-based, no local model needed

### Data Structures
```python
@dataclass
class TagEmbedding:
    tag: str           # Short form (e.g., "Building")
    long_form: str     # Full path
    prefix: str        # Library prefix (e.g., "sc:" for SCORE)
    vector: list[float]

@dataclass  
class KeywordEmbedding:
    keyword: str       # Natural language term
    targets: list[str] # HED tags this keyword points to
    vector: list[float]
```

### Search Algorithm
```python
def find_similar(query: str, top_k: int = 10) -> list[SemanticMatch]:
    # 1. Check keyword index first (deterministic)
    if query.lower() in KEYWORD_INDEX:
        return keyword_matches
    
    # 2. Embed query and search
    query_embedding = embed(query)
    
    # 3. Search keyword embeddings - collect votes
    keyword_votes = search_keywords(query_embedding)
    
    # 4. Search tag embeddings directly  
    direct_matches = search_tags(query_embedding)
    
    # 5. Combine evidence - boost tags with both sources
    return combine_and_rank(keyword_votes, direct_matches)
```

## Acceptance Criteria

- [ ] `hedit annotate --no-extend "reward delivery"` produces only existing tags
- [ ] `hedit annotate --schema "8.3.0,sc:2.0.0"` searches both base and SCORE schemas
- [ ] Semantic search suggests relevant tags for domain terms (e.g., "seizure" → sc:Seizure)
- [ ] Annotation consistency improves (same inputs → same tag suggestions)
- [ ] No significant latency increase for annotation

## References

- [hed-lsp embeddings.ts](https://github.com/hed-standard/hed-lsp/blob/main/server/src/embeddings.ts) - TypeScript implementation
- [hed-lsp KEYWORD_INDEX](https://github.com/hed-standard/hed-lsp/blob/main/server/src/embeddings.ts#L90-L696) - Curated keyword mappings
- HED Schema vocabularies: https://github.com/hed-standard/hed-schemas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Epic]: Semantic Schema Search - Embeddings-based Tag Discovery and Library Schema Support #88

Overview

Related Issues

Problem Statement

Current Limitations

Why This Matters

Proposed Solution: Embeddings-Based Semantic Search

1. Dual-Embedding Architecture

2. Keyword Index

3. Model

Benefits

Resolves #62 and #63 (Non-Extension Mode)

Resolves #87 (Multi-Schema Search)

Additional Benefits

Implementation Plan

Phase 1: Core Infrastructure

Phase 2: Integration with Annotation Agent

Phase 3: Library Schema Support

Phase 4: Optimization

Technical Details

Embedding Model Options

Data Structures

Search Algorithm

Acceptance Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic]: Semantic Schema Search - Embeddings-based Tag Discovery and Library Schema Support #88

Description

Overview

Related Issues

Problem Statement

Current Limitations

Why This Matters

Proposed Solution: Embeddings-Based Semantic Search

1. Dual-Embedding Architecture

2. Keyword Index

3. Model

Benefits

Resolves #62 and #63 (Non-Extension Mode)

Resolves #87 (Multi-Schema Search)

Additional Benefits

Implementation Plan

Phase 1: Core Infrastructure

Phase 2: Integration with Annotation Agent

Phase 3: Library Schema Support

Phase 4: Optimization

Technical Details

Embedding Model Options

Data Structures

Search Algorithm

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions