"Turn spaghetti code into a treasure map."
Legacy Code Archaeologist is a CLI tool designed to audit, map, and analyze complex legacy codebases. It combines static analysis (Tree-sitter) with semantic AI analysis (LLMs) to generate an interactive HTML Knowledge Graph.
It answers the question: "Which files are the 'God Objects' that break everything when I touch them?"
- 🗺️ Interactive Visualization: Generates a Mermaid.js graph showing file dependencies.
- 🤖 AI Risk Scoring: Uses OpenAI (GPT-4) to read code, summarize business logic, and assign a Complexity Score (1-10).
- ⚡ Smart Caching: Implements MD5 hashing (SQLite) to ensure you never pay to analyze the same file twice.
- 🌳 Robust Parsing: Uses Tree-sitter instead of Regex, so it understands code structure even if the syntax is messy.
- 🐳 Containerized: Ready to run via Docker to avoid dependency hell.
The easiest way to run the tool without compiling C-dependencies manually.
-
Clone the repository:
git clone https://github.com/yourusername/legacy-archaeologist.git cd legacy-archaeologist -
Set your API Key: Create a
.envfile in the root:echo "OPENAI_API_KEY=sk-your-api-key-here" > .env
-
Run the Audit: Map your target project to the
/codebasevolume.# Update the path below to point to the project you want to analyze docker-compose run --rm archeologist audit /codebase --output reports/audit.html(Note: Ensure your
docker-compose.ymlmounts the volume correctly as defined in the blueprints).
If you prefer running it natively, you will need Python 3.11+ and a C compiler (GCC/Clang) for Tree-sitter.
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the root directory:
OPENAI_API_KEY=sk-your-openai-keyIf no API key is provided, the tool runs in "Offline Mode" (Structure only, no Summaries).
# Basic Audit
python main.py audit /path/to/target/project
# Specify Output Location
python main.py audit ./my-project --output ./results/map.htmlOpen the generated HTML file in your browser.
- Nodes: Represent source files.
- Arrows: Represent imports/dependencies.
- Colors:
- 🔴 Red (Danger): High Risk (Score 8-10). Complex logic, high coupling.
- 🟠 Orange (Warning): Moderate Risk (Score 5-7).
- 🟢 Green (Safe): Low Risk (Score 1-4). Simple utilities or interfaces.
- Summary: A 1-sentence explanation of what the code actually does (generated by AI).
- Tags: Keywords like
Auth,Database,Legacy,API. - Metrics: Function count and Import count.
The tool follows a pipeline architecture:
- FileWalker (
core/file_walker.py): Recursively scans the directory, intelligently ignoring.git,node_modules, andvenv. - Parser (
core/parser_engine.py): Uses Tree-sitter to extract the Concrete Syntax Tree (CST). It identifies classes, functions, and imports. - Cache Check (
core/cache_manager.py): Calculates the MD5 hash of the file content. If it exists inarcheology_cache.db, it loads the data locally. - AI Analyst (
ai/summarizer.py): If not cached, sends the code "Skeleton" to OpenAI. The prompt forces a structured JSON response containing the Risk Score and Summary. - Graph Builder (
core/graph_builder.py): Compiles the nodes and edges into Mermaid syntax, handling ID sanitization to prevent graph breakage. - Reporter (
main.py): Injects the Mermaid Syntax and HTML Cards intotemplates/report_template.html.
Q: I get ImportError: cannot import name '...' from 'tree_sitter'
- A: Reinstall the dependencies. Tree-sitter requires a C compiler. On Windows, install Visual Studio Build Tools. On Mac/Linux, ensure
gccis installed.pip uninstall tree-sitter tree-sitter-languagespip install tree-sitter tree-sitter-languages --no-cache-dir
Q: The AI analysis is taking too long/costing too much.
- A: The tool processes files sequentially. For large projects (>500 files), use the Docker method and let it run in the background. The Caching system ensures you only pay for the first run. Subsequent runs are free unless you modify the code.
Q: The graph is a giant messy hairball.
- A: Legacy code is often a hairball! However, you can filter the view in the future by modifying
core/graph_builder.pyto only show edges for files with Risk > 5.
MIT License. Feel free to fork, modify, and dig up your own digital ruins.