Skip to content

"The Legacy Code Archaeologist" is a high-value tool that solves the massive pain point of technical debt. By visualizing spaghetti code, you turn an abstract headache into a concrete map.

Notifications You must be signed in to change notification settings

NeuralBlitz/Legacy-Code-Archaeologist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

🏛️ Legacy Code Archaeologist

"Turn spaghetti code into a treasure map."

Python 3.11+ Tree-sitter AI Powered

Legacy Code Archaeologist is a CLI tool designed to audit, map, and analyze complex legacy codebases. It combines static analysis (Tree-sitter) with semantic AI analysis (LLMs) to generate an interactive HTML Knowledge Graph.

It answers the question: "Which files are the 'God Objects' that break everything when I touch them?"


✨ Features

  • 🗺️ Interactive Visualization: Generates a Mermaid.js graph showing file dependencies.
  • 🤖 AI Risk Scoring: Uses OpenAI (GPT-4) to read code, summarize business logic, and assign a Complexity Score (1-10).
  • ⚡ Smart Caching: Implements MD5 hashing (SQLite) to ensure you never pay to analyze the same file twice.
  • 🌳 Robust Parsing: Uses Tree-sitter instead of Regex, so it understands code structure even if the syntax is messy.
  • 🐳 Containerized: Ready to run via Docker to avoid dependency hell.

🚀 Quick Start (Docker)

The easiest way to run the tool without compiling C-dependencies manually.

  1. Clone the repository:

    git clone https://github.com/yourusername/legacy-archaeologist.git
    cd legacy-archaeologist
  2. Set your API Key: Create a .env file in the root:

    echo "OPENAI_API_KEY=sk-your-api-key-here" > .env
  3. Run the Audit: Map your target project to the /codebase volume.

    # Update the path below to point to the project you want to analyze
    docker-compose run --rm archeologist audit /codebase --output reports/audit.html

    (Note: Ensure your docker-compose.yml mounts the volume correctly as defined in the blueprints).


🛠️ Local Installation (Python)

If you prefer running it natively, you will need Python 3.11+ and a C compiler (GCC/Clang) for Tree-sitter.

1. Environment Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration

Create a .env file in the root directory:

OPENAI_API_KEY=sk-your-openai-key

If no API key is provided, the tool runs in "Offline Mode" (Structure only, no Summaries).

3. Usage

# Basic Audit
python main.py audit /path/to/target/project

# Specify Output Location
python main.py audit ./my-project --output ./results/map.html

📊 Interpreting the Report

Open the generated HTML file in your browser.

The Graph (Left Panel)

  • Nodes: Represent source files.
  • Arrows: Represent imports/dependencies.
  • Colors:
    • 🔴 Red (Danger): High Risk (Score 8-10). Complex logic, high coupling.
    • 🟠 Orange (Warning): Moderate Risk (Score 5-7).
    • 🟢 Green (Safe): Low Risk (Score 1-4). Simple utilities or interfaces.

The Cards (Right Panel)

  • Summary: A 1-sentence explanation of what the code actually does (generated by AI).
  • Tags: Keywords like Auth, Database, Legacy, API.
  • Metrics: Function count and Import count.

🏗️ Architecture

The tool follows a pipeline architecture:

  1. FileWalker (core/file_walker.py): Recursively scans the directory, intelligently ignoring .git, node_modules, and venv.
  2. Parser (core/parser_engine.py): Uses Tree-sitter to extract the Concrete Syntax Tree (CST). It identifies classes, functions, and imports.
  3. Cache Check (core/cache_manager.py): Calculates the MD5 hash of the file content. If it exists in archeology_cache.db, it loads the data locally.
  4. AI Analyst (ai/summarizer.py): If not cached, sends the code "Skeleton" to OpenAI. The prompt forces a structured JSON response containing the Risk Score and Summary.
  5. Graph Builder (core/graph_builder.py): Compiles the nodes and edges into Mermaid syntax, handling ID sanitization to prevent graph breakage.
  6. Reporter (main.py): Injects the Mermaid Syntax and HTML Cards into templates/report_template.html.

❓ Troubleshooting

Q: I get ImportError: cannot import name '...' from 'tree_sitter'

  • A: Reinstall the dependencies. Tree-sitter requires a C compiler. On Windows, install Visual Studio Build Tools. On Mac/Linux, ensure gcc is installed.
    • pip uninstall tree-sitter tree-sitter-languages
    • pip install tree-sitter tree-sitter-languages --no-cache-dir

Q: The AI analysis is taking too long/costing too much.

  • A: The tool processes files sequentially. For large projects (>500 files), use the Docker method and let it run in the background. The Caching system ensures you only pay for the first run. Subsequent runs are free unless you modify the code.

Q: The graph is a giant messy hairball.

  • A: Legacy code is often a hairball! However, you can filter the view in the future by modifying core/graph_builder.py to only show edges for files with Risk > 5.

📜 License

MIT License. Feel free to fork, modify, and dig up your own digital ruins.

About

"The Legacy Code Archaeologist" is a high-value tool that solves the massive pain point of technical debt. By visualizing spaghetti code, you turn an abstract headache into a concrete map.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published