-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
ποΈ Master Core Infrastructure Restructure
Executive Summary
Complete restructuring of the core/ directory to eliminate redundancy, establish clear separation of concerns, and create a sustainable architecture for future development. This is a 21-day sprint to restructure the core infrastructure while maintaining all functionality.
π― Objectives
- Eliminate Redundancy: Remove
core/framework/layer (everything in core IS the framework) - Clear Separation: Establish distinct modules with single responsibilities
- Consistent Naming: All files prefixed with parent directory name
- Centralized Configuration: Single source of truth for all settings
- Validation Through Rebuild: Use database rebuild as integration test
π Current State
Baseline Metrics (2025-01-13)
- Papers Processed: 2,823,744
- Embeddings Generated: 2,823,744 (100% coverage)
- Processing Speed: 48.3 papers/second
- Total Time: 399.1 minutes for 1.16M papers
- Workers: 2 GPU workers, batch size 80
- Model: jinaai/jina-embeddings-v4 (2048 dimensions)
Issues Identified
core/framework/is redundant - everything in core IS the frameworkcore/graph/contains 12+ duplicate edge builders- Configuration scattered across multiple systems
- Storage functionality split between multiple locations
- Monitoring inconsistently implemented
ποΈ New Architecture
core/
βββ embedders/ # All embedding models (Jina, VLLM, etc.)
βββ extractors/ # Document extraction (Docling, PDFPlumber, etc.)
βββ processors/ # Processing pipelines
β βββ text/ # Text processing (chunking, cleaning)
β βββ structured/ # Tables, equations, images
β βββ code/ # Tree-sitter, symbol extraction
βββ database/ # Database interfaces
β βββ arango/ # ArangoDB client and operations
β βββ postgres/ # PostgreSQL operations
βββ workflows/ # Orchestration and pipelines
β βββ storage/ # Storage backends (S3, local, etc.)
β βββ state/ # State management
βββ monitoring/ # Metrics and observability
βββ config/ # Centralized configuration
βββ utils/ # True utilities only
βββ mcp_server/ # MCP interface (unchanged)
π Sub-Issues (To Be Created)
Phase 1: Foundation (Days 1-3)
- ποΈ Master Core Infrastructure RestructureΒ #35 - Move embedders from framework/ to embedders/
- feat: Phase 1 - Core Restructure: Embedders and Extractors MigrationΒ #36 - Move extractors from framework/ to extractors/
- π Add docstrings to
feature/core-restructure-phase1Β #37 - Reorganize processors/ with subdirectories
Phase 2: Database & Workflows (Days 4-7)
- feat: Phase 2 - Database and Workflows ConsolidationΒ #38 - Consolidate database interfaces
- feat: Phase 3 - Configuration and Monitoring Consolidation (Issue #35)Β #39 - Create workflows/ with storage integration
- feat: Phase 4 - Integration and Testing (Issue #35)Β #40 - Implement centralized configuration
Phase 3: Monitoring & Utils (Days 8-10)
- Feature/core restructure phase5 e2eΒ #41 - Consolidate monitoring from all modules
- feat: Implement dual embedders with mandatory late chunkingΒ #42 - Clean utils/ to true utilities only
Phase 4: Integration & Testing (Days 11-15)
- Feature/dual embedders late chunking 42Β #43 - Update all imports across codebase
- Critical: Fix production issues in workflow_arxiv_sorted_simple.pyΒ #44 - Run integration tests
- Fix Jina Embedder Issues and Config IntegrationΒ #45 - Rebuild database for validation
Phase 5: Documentation & Cleanup (Days 16-18)
- Remove redundant SentenceTransformersEmbedder and standardize on JinaV4Β #46 - Update all documentation
- fix: Issues #44, #45, #46 - Critical embedder and workflow fixesΒ #47 - Archive old code to Acheron/
- π Add docstrings to
feature/issue-46-remove-sentence-transformersΒ #48 - Update CLAUDE.md
Phase 6: Final Validation (Days 19-21)
- Feature: Implement gRPC Database Service LayerΒ #49 - Compare metrics with baseline
- Feature: Go gRPC Memory Service for Local AI ModelsΒ #50 - Performance benchmarking
- feat: Implement HTTP/2 optimized ArangoDB client with neural process isolationΒ #51 - Sign-off and merge
β Success Criteria
-
Functional Validation
- Database rebuild produces identical results (2,823,744 papers)
- Maintain or exceed 48.3 papers/second throughput
- All tests pass
-
Architectural Goals
- No redundant layers (framework/ eliminated)
- Clear module boundaries
- Consistent naming throughout
- Single configuration source
-
Code Quality
- All imports updated and working
- Documentation complete
- Old code archived with timestamps
π« Out of Scope
- Graph Module Refactor: Deferred until after validation (too complex for this sprint)
- New Features: Focus only on restructuring existing functionality
- Algorithm Changes: Maintain exact same processing logic
π Reference Documents
- Master PRD:
/docs/prd/core_restructure/master-core-restructure-prd.md - Baseline Snapshot:
/docs/baseline/baseline-2025-01-13.md - Database Snapshot:
/tools/arxiv/utils/baseline-snapshot.json
π Related Issues
- Supersedes feat: Implement Centralized Configuration SystemΒ #34 (Configuration System PRD)
- References baseline commit: cc85ff2
π Timeline
- Start Date: January 13, 2025
- End Date: February 3, 2025
- Duration: 21 days
π― Definition of Done
- All sub-issues completed
- Database rebuild matches baseline (2,823,744 papers)
- Performance meets or exceeds baseline (48.3 papers/sec)
- All tests passing
- Documentation updated
- Old code archived to Acheron/
- CLAUDE.md updated
- PR approved and merged to main
Note: Graph module refactor (#52) will be created as a separate epic AFTER this restructure is validated.
Metadata
Metadata
Assignees
Labels
No labels