Skip to content

πŸ—οΈ Master Core Infrastructure RestructureΒ #35

@r3d91ll

Description

@r3d91ll

πŸ—οΈ Master Core Infrastructure Restructure

Executive Summary

Complete restructuring of the core/ directory to eliminate redundancy, establish clear separation of concerns, and create a sustainable architecture for future development. This is a 21-day sprint to restructure the core infrastructure while maintaining all functionality.

🎯 Objectives

  1. Eliminate Redundancy: Remove core/framework/ layer (everything in core IS the framework)
  2. Clear Separation: Establish distinct modules with single responsibilities
  3. Consistent Naming: All files prefixed with parent directory name
  4. Centralized Configuration: Single source of truth for all settings
  5. Validation Through Rebuild: Use database rebuild as integration test

πŸ“‹ Current State

Baseline Metrics (2025-01-13)

  • Papers Processed: 2,823,744
  • Embeddings Generated: 2,823,744 (100% coverage)
  • Processing Speed: 48.3 papers/second
  • Total Time: 399.1 minutes for 1.16M papers
  • Workers: 2 GPU workers, batch size 80
  • Model: jinaai/jina-embeddings-v4 (2048 dimensions)

Issues Identified

  • core/framework/ is redundant - everything in core IS the framework
  • core/graph/ contains 12+ duplicate edge builders
  • Configuration scattered across multiple systems
  • Storage functionality split between multiple locations
  • Monitoring inconsistently implemented

πŸ›οΈ New Architecture

core/
β”œβ”€β”€ embedders/           # All embedding models (Jina, VLLM, etc.)
β”œβ”€β”€ extractors/          # Document extraction (Docling, PDFPlumber, etc.)
β”œβ”€β”€ processors/          # Processing pipelines
β”‚   β”œβ”€β”€ text/           # Text processing (chunking, cleaning)
β”‚   β”œβ”€β”€ structured/     # Tables, equations, images
β”‚   └── code/           # Tree-sitter, symbol extraction
β”œβ”€β”€ database/           # Database interfaces
β”‚   β”œβ”€β”€ arango/        # ArangoDB client and operations
β”‚   └── postgres/      # PostgreSQL operations
β”œβ”€β”€ workflows/          # Orchestration and pipelines
β”‚   β”œβ”€β”€ storage/       # Storage backends (S3, local, etc.)
β”‚   └── state/         # State management
β”œβ”€β”€ monitoring/         # Metrics and observability
β”œβ”€β”€ config/            # Centralized configuration
β”œβ”€β”€ utils/             # True utilities only
└── mcp_server/        # MCP interface (unchanged)

πŸ“Š Sub-Issues (To Be Created)

Phase 1: Foundation (Days 1-3)

Phase 2: Database & Workflows (Days 4-7)

Phase 3: Monitoring & Utils (Days 8-10)

Phase 4: Integration & Testing (Days 11-15)

Phase 5: Documentation & Cleanup (Days 16-18)

Phase 6: Final Validation (Days 19-21)

βœ… Success Criteria

  1. Functional Validation

    • Database rebuild produces identical results (2,823,744 papers)
    • Maintain or exceed 48.3 papers/second throughput
    • All tests pass
  2. Architectural Goals

    • No redundant layers (framework/ eliminated)
    • Clear module boundaries
    • Consistent naming throughout
    • Single configuration source
  3. Code Quality

    • All imports updated and working
    • Documentation complete
    • Old code archived with timestamps

🚫 Out of Scope

  • Graph Module Refactor: Deferred until after validation (too complex for this sprint)
  • New Features: Focus only on restructuring existing functionality
  • Algorithm Changes: Maintain exact same processing logic

πŸ“ Reference Documents

  • Master PRD: /docs/prd/core_restructure/master-core-restructure-prd.md
  • Baseline Snapshot: /docs/baseline/baseline-2025-01-13.md
  • Database Snapshot: /tools/arxiv/utils/baseline-snapshot.json

πŸ”„ Related Issues

πŸ“… Timeline

  • Start Date: January 13, 2025
  • End Date: February 3, 2025
  • Duration: 21 days

🎯 Definition of Done

  • All sub-issues completed
  • Database rebuild matches baseline (2,823,744 papers)
  • Performance meets or exceeds baseline (48.3 papers/sec)
  • All tests passing
  • Documentation updated
  • Old code archived to Acheron/
  • CLAUDE.md updated
  • PR approved and merged to main

Note: Graph module refactor (#52) will be created as a separate epic AFTER this restructure is validated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions