open-deepwiki

open-deepwiki is a multi-language (Java, Python, TypeScript) codebase indexing + Graph-Enriched RAG API with a "DeepWiki" documentation generator.

It parses code using tree-sitter, indexes methods and file summaries into a persistent Chroma collection, builds a dependency graph, and serves retrieval + chat endpoints via FastAPI.

What you get

Hybrid Indexing: Scans local directories or clones/pulls Git repositories (GitHub, GitLab).
Polyglot Support: Parses Java, Python, and TypeScript projects. Indexes code blocks, file summaries, and project graphs.
Graph-Enriched RAG: POST /api/v1/ask uses a retrieval strategy that combines vector search with call-graph traversal to find dependencies.
DeepWiki Documentation: Generates a static HTML documentation site ("DeepWiki") containing:
- Project Overview: High-level architectural summary.
- Feature Narratives: Functional stories linked to technical implementations.
- Deep Dives: Technical details for each feature.
Interactive Web UI: A Vite + Vue front-end to manage projects, chat with the codebase, and view generated docs.

Requirements

Python 3.10+
Node.js 18+ (only for the web UI)
Build toolchain for tree-sitter grammar build (git + a C/C++ compiler toolchain)

Setup

python3 -m venv venv
./venv/bin/pip install -r requirements.txt

Configuration

Configuration is loaded from open-deepwiki.yaml (see open-deepwiki.yaml.sample) or environment variables.

Key Settings

LLM Selection:
- chat_model: Model for answering questions (e.g., gpt-4o-mini).
- embeddings_model: Model for vectorizing code (e.g., text-embedding-3-large).
- summarization_model: Model for generating documentation (e.g., mistralai/mistral-large-latest or gpt-4o).
API Keys:
- OPENAI_API_KEY: Shared key for OpenAI services.
- Or specific keys: embeddings_api_key, chat_api_key, summarization_api_key.
Indexing:
- codebase_dir: Default root directory for local indexing.
- git_clone_directory: Cache directory for Git repositories (default: ./git_cache).
- git_access_token: Optional Personal Access Token (PAT) for private repositories.
- project_name: Name for the indexed project scope.

Example `open-deepwiki.yaml`

chat_model: gpt-4o-mini
embeddings_model: text-embedding-3-large
summarization_model: mistralai/devstral-2512:free
llm_api_base: https://api.openai.com/v1
llm_api_key: sk-...

Run the backend

./venv/bin/python app.py

Health: curl http://127.0.0.1:8000/api/v1/health

Run the web UI

cd front
npm install
npm run dev

The Vite dev server proxies /api/* to http://127.0.0.1:8000.

Run with Mock SSO (Optional)

To test the SSO flow locally without an external provider:

Start the Mock IdP:
```
./venv/bin/python mock_idp.py
```
This server runs on port 8080.

Configure .env: Ensure your .env contains the local SSO configuration:

OIDC_DISCOVERY_URL=http://localhost:8080/.well-known/openid-configuration
OIDC_CLIENT_ID=mock_client
OIDC_CLIENT_SECRET=mock_secret

Login: Go to http://localhost:5173/login, click "SSO Login", and you will be redirected through the mock flow.

Core Capabilities

1. Hybrid Indexing Pipeline

The system strictly separates scanning from indexing using a "Clean Architecture" approach:

Polyglot Parsing: Uses ParserFactory to select JavaParser, PythonParser, or TypeScriptParser based on file extension (.java, .py, .ts, .tsx).
Graph Builder: Scans import and call statements to build a directed graph of dependencies in project_graph.sqlite3.
Vector Store: Indexes code blocks (methods/classes) and optional file-level summaries into ChromaDB collections.

2. Graph-Enriched RAG

The retrieval engine (GraphEnrichedRetriever) goes beyond simple similarity search:

Vector Search: Finds the top-k most semantically relevant code blocks.
Graph Expansion: For each result, looks up its outgoing "calls" edges in the generated graph.
Context Enrichment: Fetches the code/docs of the called dependencies and injects them into the LLM context.
- Result: The LLM sees not just function A, but also the signature/docs of function B that A calls.

3. "DeepWiki" Documentation Engine

The agentic pipeline (core/documentation) generates a full static documentation site in 5 distinct stages:

File Summarization: Generates technical summaries for every source file.
Feature Detection: Clusters files into "Features" using a hybrid approach (Vector Clustering + LLM Refinement).
Module Synthesis: Generates architectural "READMEs" for every directory, explaining its role in the system.
Project Overview: Synthesizes a high-level "Executive Summary" of the entire codebase.
Site Generation: Outputs a structured HTML site with Mermaid diagrams visualizing the architecture.

API and examples

See USAGE.md for endpoint list and curl examples.

Semantic Indexing Evaluation

A dedicated script evaluate_indexing.py is available to test the quality of feature detection and module synthesis against a controlled Java fixture project.

Running the Evaluation

Ensure the backend environment is active (source venv/bin/activate).
Run the script:
```
./venv/bin/python evaluate_indexing.py
```

Interpreting Results

The script runs the SemanticSummarizer and FeatureDetector on a mini Java project located in fixtures/mini-java-project. It logs:

Module Detection: Whether the auth and billing modules were correctly identified.
Feature Detection: Whether "Authentication" and "Billing" features were extracted.
Facet Analysis: Recommendations for "Deep Dives" (e.g., Security, Data Flow).

Pass/Fail Criteria:

PASS: Core modules and features are found.
WARN: Some elements might be missing or misclassified.
FAIL: Critical components (like Auth feature) were missed.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
.agent/rules		.agent/rules
.github		.github
TEST_OUTPUT		TEST_OUTPUT
backend		backend
docs/architecture		docs/architecture
fixtures		fixtures
front		front
scripts		scripts
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DOCUMENTATION.md		DOCUMENTATION.md
LICENCE		LICENCE
README.md		README.md
THEME.md		THEME.md
TODO.md		TODO.md
USAGE.md		USAGE.md
app.py		app.py
evaluate_indexing.py		evaluate_indexing.py
mock_idp.py		mock_idp.py
open-deepwiki.yaml.sample		open-deepwiki.yaml.sample
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

open-deepwiki

What you get

Requirements

Setup

Configuration

Key Settings

Example `open-deepwiki.yaml`

Run the backend

Run the web UI

Run with Mock SSO (Optional)

Core Capabilities

1. Hybrid Indexing Pipeline

2. Graph-Enriched RAG

3. "DeepWiki" Documentation Engine

API and examples

Semantic Indexing Evaluation

Running the Evaluation

Interpreting Results

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

cyberbobjr/open-deepwiki

Folders and files

Latest commit

History

Repository files navigation

open-deepwiki

What you get

Requirements

Setup

Configuration

Key Settings

Example open-deepwiki.yaml

Run the backend

Run the web UI

Run with Mock SSO (Optional)

Core Capabilities

1. Hybrid Indexing Pipeline

2. Graph-Enriched RAG

3. "DeepWiki" Documentation Engine

API and examples

Semantic Indexing Evaluation

Running the Evaluation

Interpreting Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Example `open-deepwiki.yaml`

Packages