Skip to content

Fix Jina Embedder Issues and Config Integration #45

@r3d91ll

Description

@r3d91ll

Jina Embedder Critical Issues

Found during code review. These issues affect performance, reliability, and configuration management.

🔴 CRITICAL Issues (Performance/Functionality)

  1. Redundant Encoding Bug - Line 620-771

    • Problem: _chunk_with_context_windows encodes document N times for N chunks
    • Impact: O(N²) complexity instead of O(N), massive performance degradation
    • Fix: Encode once, then chunk the embeddings
  2. Config Extraction Broken - Lines 79-85

    • Problem: When EmbeddingConfig passed, only extracts device/batch_size/use_fp16
    • Missing: chunk_size_tokens, chunk_overlap_tokens, model_name (though model_name standardization is intentional)
    • Fix: Extract all config fields, especially chunking parameters
  3. Hardcoded Chunk Parameters - Lines 83-84

    • Problem: Forces chunk_size_tokens=1000, chunk_overlap_tokens=200
    • Impact: Can't tune chunking for different document types
    • Fix: Use config values if provided

🟡 HIGH Priority Issues

  1. No CUDA Check - Line 98

    • Problem: Defaults to 'cuda' without checking torch.cuda.is_available()
    • Impact: Crashes on CPU-only systems
    • Fix: Check CUDA availability, fallback to CPU
  2. Missing Multimodal Guard - Line 359

    • Problem: Calls model.encode_image without checking if method exists
    • Impact: AttributeError on text-only models
    • Fix: Add hasattr check or disable multimodal for text-only variants

🟢 MEDIUM Priority Issues

  1. No Truncation Warning - Line 800

    • Problem: Silently truncates documents > 32k tokens
    • Fix: Log warning when truncation occurs
  2. Missing Type Annotations

    • Problem: No type hints on most methods
    • Impact: Fails mypy strict checks
    • Fix: Add comprehensive type annotations

Architecture Context

  • Standardizing on Jina v4 is intentional (one model, one RAG architecture)
  • 32k context window enables full paper processing
  • Late chunking preserves context per HADES framework

Config Integration Needed

  • Move chunk parameters to YAML config in /core/config/workflows/
  • Create embedder-specific config files
  • Ensure CLI args properly reach workers

Related Issues

Test Requirements

  • Test redundant encoding fix with various document sizes
  • Verify config parameter passing
  • Test CPU fallback behavior
  • Benchmark performance improvements

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions