open-deepwiki is a multi-language (Java, Python, TypeScript) codebase indexing + Graph-Enriched RAG API with a "DeepWiki" documentation generator.
It parses code using tree-sitter, indexes methods and file summaries into a persistent Chroma collection, builds a dependency graph, and serves retrieval + chat endpoints via FastAPI.
- Hybrid Indexing: Scans local directories or clones/pulls Git repositories (GitHub, GitLab).
- Polyglot Support: Parses Java, Python, and TypeScript projects. Indexes code blocks, file summaries, and project graphs.
- Graph-Enriched RAG:
POST /api/v1/askuses a retrieval strategy that combines vector search with call-graph traversal to find dependencies. - DeepWiki Documentation: Generates a static HTML documentation site ("DeepWiki") containing:
- Project Overview: High-level architectural summary.
- Feature Narratives: Functional stories linked to technical implementations.
- Deep Dives: Technical details for each feature.
- Interactive Web UI: A Vite + Vue front-end to manage projects, chat with the codebase, and view generated docs.
- Python 3.10+
- Node.js 18+ (only for the web UI)
- Build toolchain for tree-sitter grammar build (git + a C/C++ compiler toolchain)
python3 -m venv venv
./venv/bin/pip install -r requirements.txtConfiguration is loaded from open-deepwiki.yaml (see open-deepwiki.yaml.sample) or environment variables.
-
LLM Selection:
chat_model: Model for answering questions (e.g.,gpt-4o-mini).embeddings_model: Model for vectorizing code (e.g.,text-embedding-3-large).summarization_model: Model for generating documentation (e.g.,mistralai/mistral-large-latestorgpt-4o).
-
API Keys:
OPENAI_API_KEY: Shared key for OpenAI services.- Or specific keys:
embeddings_api_key,chat_api_key,summarization_api_key.
-
Indexing:
codebase_dir: Default root directory for local indexing.git_clone_directory: Cache directory for Git repositories (default:./git_cache).git_access_token: Optional Personal Access Token (PAT) for private repositories.project_name: Name for the indexed project scope.
chat_model: gpt-4o-mini
embeddings_model: text-embedding-3-large
summarization_model: mistralai/devstral-2512:free
llm_api_base: https://api.openai.com/v1
llm_api_key: sk-..../venv/bin/python app.pyHealth: curl http://127.0.0.1:8000/api/v1/health
cd front
npm install
npm run devThe Vite dev server proxies /api/* to http://127.0.0.1:8000.
To test the SSO flow locally without an external provider:
-
Start the Mock IdP:
./venv/bin/python mock_idp.py
This server runs on port
8080. -
Configure
.env: Ensure your.envcontains the local SSO configuration:OIDC_DISCOVERY_URL=http://localhost:8080/.well-known/openid-configuration OIDC_CLIENT_ID=mock_client OIDC_CLIENT_SECRET=mock_secret
-
Login: Go to
http://localhost:5173/login, click "SSO Login", and you will be redirected through the mock flow.
The system strictly separates scanning from indexing using a "Clean Architecture" approach:
- Polyglot Parsing: Uses
ParserFactoryto selectJavaParser,PythonParser, orTypeScriptParserbased on file extension (.java,.py,.ts,.tsx). - Graph Builder: Scans
importandcallstatements to build a directed graph of dependencies inproject_graph.sqlite3. - Vector Store: Indexes code blocks (methods/classes) and optional file-level summaries into ChromaDB collections.
The retrieval engine (GraphEnrichedRetriever) goes beyond simple similarity search:
- Vector Search: Finds the top-k most semantically relevant code blocks.
- Graph Expansion: For each result, looks up its outgoing "calls" edges in the generated graph.
- Context Enrichment: Fetches the code/docs of the called dependencies and injects them into the LLM context.
- Result: The LLM sees not just
function A, but also the signature/docs offunction BthatAcalls.
- Result: The LLM sees not just
The agentic pipeline (core/documentation) generates a full static documentation site in 5 distinct stages:
- File Summarization: Generates technical summaries for every source file.
- Feature Detection: Clusters files into "Features" using a hybrid approach (Vector Clustering + LLM Refinement).
- Module Synthesis: Generates architectural "READMEs" for every directory, explaining its role in the system.
- Project Overview: Synthesizes a high-level "Executive Summary" of the entire codebase.
- Site Generation: Outputs a structured HTML site with Mermaid diagrams visualizing the architecture.
See USAGE.md for endpoint list and curl examples.
A dedicated script evaluate_indexing.py is available to test the quality of feature detection and module synthesis against a controlled Java fixture project.
-
Ensure the backend environment is active (
source venv/bin/activate). -
Run the script:
./venv/bin/python evaluate_indexing.py
The script runs the SemanticSummarizer and FeatureDetector on a mini Java project located in fixtures/mini-java-project. It logs:
- Module Detection: Whether the
authandbillingmodules were correctly identified. - Feature Detection: Whether "Authentication" and "Billing" features were extracted.
- Facet Analysis: Recommendations for "Deep Dives" (e.g., Security, Data Flow).
Pass/Fail Criteria:
PASS: Core modules and features are found.WARN: Some elements might be missing or misclassified.FAIL: Critical components (like Auth feature) were missed.