Skip to content

[STORY] Fix Temporal Indexing Filename Length Issue #669

@jsbattig

Description

@jsbattig

Story: Fix Temporal Indexing Filename Length Issue

As a developer indexing repositories with long file paths
I want to successfully index temporal data without filesystem errors
So that I can search git history for all files regardless of path length


Problem Statement

Temporal indexing fails on repositories with long file paths due to filesystem 255-character filename limits. The current implementation creates filenames like:

vector_txt-db:diff:646986fd:TxtDb.Database.Tests/ConcurrencyTests/TableCachingIssueExposureTests.cs:2.json

When point_id components combine (project_id + commit_hash + file_path + chunk_index), filenames can exceed 255 characters, causing OSError exceptions.

Conversation Reference: User discovered this issue when temporal indexing failed on a repository with deeply nested test file paths: "encountered errors when running cidx index --index-commits on a repository with long file paths".


Implementation Status

  • Hash-based filename generation (16-char SHA256 prefix for v2 format)
  • Metadata storage system (point_id-to-hash mapping with full metadata)
  • SQLite index for efficient metadata queries
  • Format detection logic (v2 vs v1 detection via temporal_metadata.db presence)
  • Graceful error handling with clear messages
  • Re-index instructions in error messages for v1 format detection
  • CLI warnings for v1 format detection
  • Web UI status display for temporal indexing health
  • Auto-cleanup on re-index (remove stale metadata)
  • Unit tests for all new components
  • Integration tests for end-to-end temporal indexing

Completion: 0/11 tasks complete (0%)

Conversation Reference: Implementation tasks derived from conversation discussion about "hash-based filename approach with metadata database" and "graceful v1 detection with user-friendly error messages".


Algorithm

Filename Generation (v2 Format):
  generate_vector_filename(point_id):
    # v2 format: Always use hash-based naming for temporal collections
    hash_prefix = sha256(point_id.encode()).hexdigest()[:16]
    RETURN f"vector_{hash_prefix}.json"

Metadata Storage:
  metadata_index = Dict[hash_prefix -> {
    "point_id": original_point_id,
    "commit_hash": commit.hash,
    "file_path": diff_info.file_path,
    "chunk_index": chunk_index,
    "created_at": timestamp
  }]

  save_metadata(metadata_index):
    # Store as SQLite for efficient queries
    CREATE TABLE IF NOT EXISTS temporal_metadata (
      hash_prefix TEXT PRIMARY KEY,
      point_id TEXT NOT NULL,
      commit_hash TEXT,
      file_path TEXT,
      chunk_index INTEGER,
      created_at TEXT
    )
    INSERT OR REPLACE INTO temporal_metadata VALUES (...)

Format Detection:
  detect_format(collection_path):
    # v2 format detection: Check for temporal_metadata.db presence
    IF (collection_path / "temporal_metadata.db").exists():
      RETURN "v2"
    # Otherwise, v1 format
    RETURN "v1"

V1 Format Error Handling:
  handle_v1_format(collection_path):
    # Detect v1 format and error gracefully
    IF detect_format(collection_path) == "v1":
      LOG ERROR: "Legacy temporal index format (v1) detected"
      DISPLAY: "Re-index with: cidx index --index-commits --reconcile"
      STOP processing without corruption

Auto-Cleanup:
  cleanup_stale_metadata(collection_path, valid_point_ids):
    # Remove metadata entries without corresponding vector files
    FOR hash_prefix IN metadata_index.keys():
      IF hash_prefix NOT IN valid_point_ids:
        DELETE FROM temporal_metadata WHERE hash_prefix = hash_prefix

Conversation Reference: Simplified format detection algorithm matches conversation consensus: "Just check for temporal_metadata.db presence - if exists, it's v2; otherwise v1" and "No migration code - let user re-index with --reconcile flag".


Acceptance Criteria

Scenario 1: Index repository with long file paths (v2 format)
  Given a repository with files having paths longer than 200 characters
  When I run temporal indexing with "cidx index --index-commits"
  Then all files should be indexed without OSError exceptions
  And filenames in the temporal collection should use v2 hash-based format
  And all filenames should be under 255 characters
  And temporal_metadata.db should contain mapping for all indexed files

Scenario 2: Hash-based filename generation
  Given a point_id of any length
  When the vector is saved to the filesystem
  Then a 16-character SHA256 hash prefix should be used as the filename
  And the full point_id should be stored in temporal_metadata.db
  And the hash prefix should deterministically map to the same point_id

Scenario 3: Query v2 format temporal collection
  Given a temporal collection with v2 format vectors
  When I query with "cidx query --time-range-all"
  Then the metadata index should resolve hash prefixes to point_ids
  And query results should include correct file paths and commit info
  And performance should be comparable to non-temporal queries

Scenario 4: Detect v1 format and error gracefully
  Given a temporal collection in v1 format (legacy)
  When the system attempts to load the collection
  Then it should detect v1 format (no temporal_metadata.db present)
  And display error: "Legacy temporal index format (v1) detected"
  And provide instructions: "Re-index with: cidx index --index-commits --reconcile"
  And stop processing without corrupting existing data

Scenario 5: Re-indexing with reconcile cleans up properly
  Given a temporal collection in v1 format
  When I run "cidx index --index-commits --reconcile"
  Then the old v1 vector files should be removed
  And new v2 format files should be created
  And temporal_metadata.db should be created with all mappings
  And subsequent queries should work correctly

Scenario 6: Web UI shows temporal indexing status
  Given temporal indexing state (v1, v2, or none)
  When I view the dashboard
  Then I should see temporal index status indicator
  And v1 format should show warning with re-index instructions
  And v2 format should show healthy status with file count

Scenario 7: Graceful filesystem error handling
  Given a filesystem error during vector file creation
  When the error occurs
  Then a clear error message should be logged
  And indexing should fail gracefully without corrupting data
  And partial progress should be preserved where possible

Conversation Reference: Scenarios 1-3, 5, 7 derived from conversation discussion about "indexing long paths", "hash-based naming", and "querying v2 format". Scenario 4 added per conversation requirement: "detect v1 and error gracefully with re-index instructions". Scenario 6 added per conversation requirement: "Web UI should show temporal index health status with v1 warnings".


Testing Requirements

Unit Tests

  • test_generate_vector_filename_v2_format: Verify v2 hash-based format always used
  • test_hash_determinism: Same point_id always produces same hash
  • test_metadata_save_load: Round-trip metadata through SQLite
  • test_format_detection_v1: Correctly detect v1 format (no temporal_metadata.db)
  • test_format_detection_v2: Correctly detect v2 format (temporal_metadata.db exists)
  • test_point_id_extraction_from_filename_v2: Extract point_id from v2 hash-based format via metadata

Integration Tests

  • test_temporal_indexing_long_paths: End-to-end indexing with 200+ char paths
  • test_temporal_query_v2_format: Query returns correct results with v2 format
  • test_v1_detection_errors_gracefully: v1 format detection stops with clear error
  • test_reindex_cleans_stale_metadata: Re-indexing removes orphaned metadata

Manual Testing

  • Create test repository with deeply nested directory structure
  • Run cidx index --index-commits and verify success
  • Run cidx query --time-range-all "search term" and verify results
  • Check .code-indexer/index/code-indexer-temporal/ for filename lengths
  • Verify temporal_metadata.db exists and contains correct mappings
  • Test v1 format detection displays correct error message with re-index instructions

Conversation Reference: Testing requirements adjusted per conversation: "Remove backwards compatibility tests", "Remove mixed format tests", "Remove v1 extraction from filename tests - only test v2 metadata-based extraction".


Technical Details

Filename Format Comparison

v1 Format (legacy, will error):

vector_{project_id}:{type}:{commit_hash}:{file_path}:{chunk_index}.json
Example: vector_txt-db:diff:646986fd:TxtDb.Database.Tests/ConcurrencyTests/TableCachingIssueExposureTests.cs:2.json
Length: 100+ characters, can exceed 255
Status: Detection errors gracefully, requires re-indexing

v2 Format (new, preferred):

vector_{sha256(point_id)[:16]}.json
Example: vector_a1b2c3d4e5f67890.json
Length: Fixed 28 characters

SQLite Schema

CREATE TABLE IF NOT EXISTS temporal_metadata (
    hash_prefix TEXT PRIMARY KEY,
    point_id TEXT NOT NULL UNIQUE,
    commit_hash TEXT,
    file_path TEXT,
    chunk_index INTEGER,
    created_at TEXT,
    format_version INTEGER DEFAULT 2
);

CREATE INDEX idx_point_id ON temporal_metadata(point_id);
CREATE INDEX idx_commit_hash ON temporal_metadata(commit_hash);
CREATE INDEX idx_file_path ON temporal_metadata(file_path);

Files to Modify

  1. src/code_indexer/storage/filesystem_vector_store.py

    • Add _generate_vector_filename() method (always v2 format for temporal)
    • Modify upsert_points() to use new filename generation
    • Add metadata storage for v2 format
  2. src/code_indexer/services/temporal/temporal_indexer.py

    • Update point_id generation to remain unchanged
    • Storage layer handles filename length transparently
  3. New file: src/code_indexer/storage/temporal_metadata_store.py

    • SQLite-based metadata storage
    • Format detection logic (temporal_metadata.db presence check)
    • V1 format error handling

Conversation Reference: File modifications align with conversation design: "Storage layer handles v2 format transparently", "Separate metadata store module for temporal_metadata.db", "No migration code - format detection only errors on v1".


Definition of Done

  • All acceptance criteria satisfied
  • >90% unit test coverage achieved for new code
  • Integration tests passing
  • E2E tests with zero mocking passing
  • Code review approved (tdd-engineer + code-reviewer workflow)
  • Manual end-to-end testing completed by Claude Code
  • No lint/type errors (./lint.sh passes)
  • fast-automation.sh passes
  • Documentation updated (CLAUDE.md if needed)
  • Working software deployable to users

Conversation References

  • Problem Discovery: User reported temporal indexing failure with OSError on long file paths: "encountered errors when running cidx index --index-commits"
  • Solution Design: Hash-based filenames with metadata storage discussed: "16-char SHA256 prefix with SQLite metadata database"
  • No Migration: Explicitly excluded migration code: "No migration - just detect v1 and error with re-index instructions"
  • No Backwards Compatibility: Explicitly excluded v1 support: "Don't support v1 queries - require re-indexing to v2"
  • Format Detection: Simplified to metadata db presence: "Check for temporal_metadata.db - if exists, v2; otherwise v1"
  • Web UI Requirements: Status display discussed: "Web UI should show temporal index health with v1 warnings"
  • CLI Warnings: User-friendly errors required: "Clear error messages with re-index command suggestions"
  • Testing Strategy: Unit tests for hash generation, integration tests for indexing, manual CLI verification, no v1 compatibility testing

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions