-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Story: Fix Temporal Indexing Filename Length Issue
As a developer indexing repositories with long file paths
I want to successfully index temporal data without filesystem errors
So that I can search git history for all files regardless of path length
Problem Statement
Temporal indexing fails on repositories with long file paths due to filesystem 255-character filename limits. The current implementation creates filenames like:
vector_txt-db:diff:646986fd:TxtDb.Database.Tests/ConcurrencyTests/TableCachingIssueExposureTests.cs:2.json
When point_id components combine (project_id + commit_hash + file_path + chunk_index), filenames can exceed 255 characters, causing OSError exceptions.
Conversation Reference: User discovered this issue when temporal indexing failed on a repository with deeply nested test file paths: "encountered errors when running cidx index --index-commits on a repository with long file paths".
Implementation Status
- Hash-based filename generation (16-char SHA256 prefix for v2 format)
- Metadata storage system (point_id-to-hash mapping with full metadata)
- SQLite index for efficient metadata queries
- Format detection logic (v2 vs v1 detection via temporal_metadata.db presence)
- Graceful error handling with clear messages
- Re-index instructions in error messages for v1 format detection
- CLI warnings for v1 format detection
- Web UI status display for temporal indexing health
- Auto-cleanup on re-index (remove stale metadata)
- Unit tests for all new components
- Integration tests for end-to-end temporal indexing
Completion: 0/11 tasks complete (0%)
Conversation Reference: Implementation tasks derived from conversation discussion about "hash-based filename approach with metadata database" and "graceful v1 detection with user-friendly error messages".
Algorithm
Filename Generation (v2 Format):
generate_vector_filename(point_id):
# v2 format: Always use hash-based naming for temporal collections
hash_prefix = sha256(point_id.encode()).hexdigest()[:16]
RETURN f"vector_{hash_prefix}.json"
Metadata Storage:
metadata_index = Dict[hash_prefix -> {
"point_id": original_point_id,
"commit_hash": commit.hash,
"file_path": diff_info.file_path,
"chunk_index": chunk_index,
"created_at": timestamp
}]
save_metadata(metadata_index):
# Store as SQLite for efficient queries
CREATE TABLE IF NOT EXISTS temporal_metadata (
hash_prefix TEXT PRIMARY KEY,
point_id TEXT NOT NULL,
commit_hash TEXT,
file_path TEXT,
chunk_index INTEGER,
created_at TEXT
)
INSERT OR REPLACE INTO temporal_metadata VALUES (...)
Format Detection:
detect_format(collection_path):
# v2 format detection: Check for temporal_metadata.db presence
IF (collection_path / "temporal_metadata.db").exists():
RETURN "v2"
# Otherwise, v1 format
RETURN "v1"
V1 Format Error Handling:
handle_v1_format(collection_path):
# Detect v1 format and error gracefully
IF detect_format(collection_path) == "v1":
LOG ERROR: "Legacy temporal index format (v1) detected"
DISPLAY: "Re-index with: cidx index --index-commits --reconcile"
STOP processing without corruption
Auto-Cleanup:
cleanup_stale_metadata(collection_path, valid_point_ids):
# Remove metadata entries without corresponding vector files
FOR hash_prefix IN metadata_index.keys():
IF hash_prefix NOT IN valid_point_ids:
DELETE FROM temporal_metadata WHERE hash_prefix = hash_prefix
Conversation Reference: Simplified format detection algorithm matches conversation consensus: "Just check for temporal_metadata.db presence - if exists, it's v2; otherwise v1" and "No migration code - let user re-index with --reconcile flag".
Acceptance Criteria
Scenario 1: Index repository with long file paths (v2 format)
Given a repository with files having paths longer than 200 characters
When I run temporal indexing with "cidx index --index-commits"
Then all files should be indexed without OSError exceptions
And filenames in the temporal collection should use v2 hash-based format
And all filenames should be under 255 characters
And temporal_metadata.db should contain mapping for all indexed files
Scenario 2: Hash-based filename generation
Given a point_id of any length
When the vector is saved to the filesystem
Then a 16-character SHA256 hash prefix should be used as the filename
And the full point_id should be stored in temporal_metadata.db
And the hash prefix should deterministically map to the same point_id
Scenario 3: Query v2 format temporal collection
Given a temporal collection with v2 format vectors
When I query with "cidx query --time-range-all"
Then the metadata index should resolve hash prefixes to point_ids
And query results should include correct file paths and commit info
And performance should be comparable to non-temporal queries
Scenario 4: Detect v1 format and error gracefully
Given a temporal collection in v1 format (legacy)
When the system attempts to load the collection
Then it should detect v1 format (no temporal_metadata.db present)
And display error: "Legacy temporal index format (v1) detected"
And provide instructions: "Re-index with: cidx index --index-commits --reconcile"
And stop processing without corrupting existing data
Scenario 5: Re-indexing with reconcile cleans up properly
Given a temporal collection in v1 format
When I run "cidx index --index-commits --reconcile"
Then the old v1 vector files should be removed
And new v2 format files should be created
And temporal_metadata.db should be created with all mappings
And subsequent queries should work correctly
Scenario 6: Web UI shows temporal indexing status
Given temporal indexing state (v1, v2, or none)
When I view the dashboard
Then I should see temporal index status indicator
And v1 format should show warning with re-index instructions
And v2 format should show healthy status with file count
Scenario 7: Graceful filesystem error handling
Given a filesystem error during vector file creation
When the error occurs
Then a clear error message should be logged
And indexing should fail gracefully without corrupting data
And partial progress should be preserved where possibleConversation Reference: Scenarios 1-3, 5, 7 derived from conversation discussion about "indexing long paths", "hash-based naming", and "querying v2 format". Scenario 4 added per conversation requirement: "detect v1 and error gracefully with re-index instructions". Scenario 6 added per conversation requirement: "Web UI should show temporal index health status with v1 warnings".
Testing Requirements
Unit Tests
test_generate_vector_filename_v2_format: Verify v2 hash-based format always usedtest_hash_determinism: Same point_id always produces same hashtest_metadata_save_load: Round-trip metadata through SQLitetest_format_detection_v1: Correctly detect v1 format (no temporal_metadata.db)test_format_detection_v2: Correctly detect v2 format (temporal_metadata.db exists)test_point_id_extraction_from_filename_v2: Extract point_id from v2 hash-based format via metadata
Integration Tests
test_temporal_indexing_long_paths: End-to-end indexing with 200+ char pathstest_temporal_query_v2_format: Query returns correct results with v2 formattest_v1_detection_errors_gracefully: v1 format detection stops with clear errortest_reindex_cleans_stale_metadata: Re-indexing removes orphaned metadata
Manual Testing
- Create test repository with deeply nested directory structure
- Run
cidx index --index-commitsand verify success - Run
cidx query --time-range-all "search term"and verify results - Check
.code-indexer/index/code-indexer-temporal/for filename lengths - Verify
temporal_metadata.dbexists and contains correct mappings - Test v1 format detection displays correct error message with re-index instructions
Conversation Reference: Testing requirements adjusted per conversation: "Remove backwards compatibility tests", "Remove mixed format tests", "Remove v1 extraction from filename tests - only test v2 metadata-based extraction".
Technical Details
Filename Format Comparison
v1 Format (legacy, will error):
vector_{project_id}:{type}:{commit_hash}:{file_path}:{chunk_index}.json
Example: vector_txt-db:diff:646986fd:TxtDb.Database.Tests/ConcurrencyTests/TableCachingIssueExposureTests.cs:2.json
Length: 100+ characters, can exceed 255
Status: Detection errors gracefully, requires re-indexing
v2 Format (new, preferred):
vector_{sha256(point_id)[:16]}.json
Example: vector_a1b2c3d4e5f67890.json
Length: Fixed 28 characters
SQLite Schema
CREATE TABLE IF NOT EXISTS temporal_metadata (
hash_prefix TEXT PRIMARY KEY,
point_id TEXT NOT NULL UNIQUE,
commit_hash TEXT,
file_path TEXT,
chunk_index INTEGER,
created_at TEXT,
format_version INTEGER DEFAULT 2
);
CREATE INDEX idx_point_id ON temporal_metadata(point_id);
CREATE INDEX idx_commit_hash ON temporal_metadata(commit_hash);
CREATE INDEX idx_file_path ON temporal_metadata(file_path);Files to Modify
-
src/code_indexer/storage/filesystem_vector_store.py- Add
_generate_vector_filename()method (always v2 format for temporal) - Modify
upsert_points()to use new filename generation - Add metadata storage for v2 format
- Add
-
src/code_indexer/services/temporal/temporal_indexer.py- Update point_id generation to remain unchanged
- Storage layer handles filename length transparently
-
New file:
src/code_indexer/storage/temporal_metadata_store.py- SQLite-based metadata storage
- Format detection logic (temporal_metadata.db presence check)
- V1 format error handling
Conversation Reference: File modifications align with conversation design: "Storage layer handles v2 format transparently", "Separate metadata store module for temporal_metadata.db", "No migration code - format detection only errors on v1".
Definition of Done
- All acceptance criteria satisfied
- >90% unit test coverage achieved for new code
- Integration tests passing
- E2E tests with zero mocking passing
- Code review approved (tdd-engineer + code-reviewer workflow)
- Manual end-to-end testing completed by Claude Code
- No lint/type errors (./lint.sh passes)
- fast-automation.sh passes
- Documentation updated (CLAUDE.md if needed)
- Working software deployable to users
Conversation References
- Problem Discovery: User reported temporal indexing failure with OSError on long file paths: "encountered errors when running cidx index --index-commits"
- Solution Design: Hash-based filenames with metadata storage discussed: "16-char SHA256 prefix with SQLite metadata database"
- No Migration: Explicitly excluded migration code: "No migration - just detect v1 and error with re-index instructions"
- No Backwards Compatibility: Explicitly excluded v1 support: "Don't support v1 queries - require re-indexing to v2"
- Format Detection: Simplified to metadata db presence: "Check for temporal_metadata.db - if exists, v2; otherwise v1"
- Web UI Requirements: Status display discussed: "Web UI should show temporal index health with v1 warnings"
- CLI Warnings: User-friendly errors required: "Clear error messages with re-index command suggestions"
- Testing Strategy: Unit tests for hash generation, integration tests for indexing, manual CLI verification, no v1 compatibility testing