-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Jina Embedder Critical Issues
Found during code review. These issues affect performance, reliability, and configuration management.
🔴 CRITICAL Issues (Performance/Functionality)
-
Redundant Encoding Bug - Line 620-771
- Problem:
_chunk_with_context_windowsencodes document N times for N chunks - Impact: O(N²) complexity instead of O(N), massive performance degradation
- Fix: Encode once, then chunk the embeddings
- Problem:
-
Config Extraction Broken - Lines 79-85
- Problem: When EmbeddingConfig passed, only extracts device/batch_size/use_fp16
- Missing: chunk_size_tokens, chunk_overlap_tokens, model_name (though model_name standardization is intentional)
- Fix: Extract all config fields, especially chunking parameters
-
Hardcoded Chunk Parameters - Lines 83-84
- Problem: Forces chunk_size_tokens=1000, chunk_overlap_tokens=200
- Impact: Can't tune chunking for different document types
- Fix: Use config values if provided
🟡 HIGH Priority Issues
-
No CUDA Check - Line 98
- Problem: Defaults to 'cuda' without checking
torch.cuda.is_available() - Impact: Crashes on CPU-only systems
- Fix: Check CUDA availability, fallback to CPU
- Problem: Defaults to 'cuda' without checking
-
Missing Multimodal Guard - Line 359
- Problem: Calls
model.encode_imagewithout checking if method exists - Impact: AttributeError on text-only models
- Fix: Add
hasattrcheck or disable multimodal for text-only variants
- Problem: Calls
🟢 MEDIUM Priority Issues
-
No Truncation Warning - Line 800
- Problem: Silently truncates documents > 32k tokens
- Fix: Log warning when truncation occurs
-
Missing Type Annotations
- Problem: No type hints on most methods
- Impact: Fails mypy strict checks
- Fix: Add comprehensive type annotations
Architecture Context
- Standardizing on Jina v4 is intentional (one model, one RAG architecture)
- 32k context window enables full paper processing
- Late chunking preserves context per HADES framework
Config Integration Needed
- Move chunk parameters to YAML config in /core/config/workflows/
- Create embedder-specific config files
- Ensure CLI args properly reach workers
Related Issues
- Critical: Fix production issues in workflow_arxiv_sorted_simple.py #44 - Workflow critical fixes (storage drain, GPU assignment, etc.)
- Feature/dual embedders late chunking 42 #43 - Chunking edge case fixes
Test Requirements
- Test redundant encoding fix with various document sizes
- Verify config parameter passing
- Test CPU fallback behavior
- Benchmark performance improvements
Metadata
Metadata
Assignees
Labels
No labels