Skip to content

cyberbobjr/open-deepwiki

Repository files navigation

open-deepwiki

open-deepwiki is a multi-language (Java, Python, TypeScript) codebase indexing + Graph-Enriched RAG API with a "DeepWiki" documentation generator.

It parses code using tree-sitter, indexes methods and file summaries into a persistent Chroma collection, builds a dependency graph, and serves retrieval + chat endpoints via FastAPI.

What you get

  • Hybrid Indexing: Scans local directories or clones/pulls Git repositories (GitHub, GitLab).
  • Polyglot Support: Parses Java, Python, and TypeScript projects. Indexes code blocks, file summaries, and project graphs.
  • Graph-Enriched RAG: POST /api/v1/ask uses a retrieval strategy that combines vector search with call-graph traversal to find dependencies.
  • DeepWiki Documentation: Generates a static HTML documentation site ("DeepWiki") containing:
    • Project Overview: High-level architectural summary.
    • Feature Narratives: Functional stories linked to technical implementations.
    • Deep Dives: Technical details for each feature.
  • Interactive Web UI: A Vite + Vue front-end to manage projects, chat with the codebase, and view generated docs.

Requirements

  • Python 3.10+
  • Node.js 18+ (only for the web UI)
  • Build toolchain for tree-sitter grammar build (git + a C/C++ compiler toolchain)

Setup

python3 -m venv venv
./venv/bin/pip install -r requirements.txt

Configuration

Configuration is loaded from open-deepwiki.yaml (see open-deepwiki.yaml.sample) or environment variables.

Key Settings

  • LLM Selection:

    • chat_model: Model for answering questions (e.g., gpt-4o-mini).
    • embeddings_model: Model for vectorizing code (e.g., text-embedding-3-large).
    • summarization_model: Model for generating documentation (e.g., mistralai/mistral-large-latest or gpt-4o).
  • API Keys:

    • OPENAI_API_KEY: Shared key for OpenAI services.
    • Or specific keys: embeddings_api_key, chat_api_key, summarization_api_key.
  • Indexing:

    • codebase_dir: Default root directory for local indexing.
    • git_clone_directory: Cache directory for Git repositories (default: ./git_cache).
    • git_access_token: Optional Personal Access Token (PAT) for private repositories.
    • project_name: Name for the indexed project scope.

Example open-deepwiki.yaml

chat_model: gpt-4o-mini
embeddings_model: text-embedding-3-large
summarization_model: mistralai/devstral-2512:free
llm_api_base: https://api.openai.com/v1
llm_api_key: sk-...

Run the backend

./venv/bin/python app.py

Health: curl http://127.0.0.1:8000/api/v1/health

Run the web UI

cd front
npm install
npm run dev

The Vite dev server proxies /api/* to http://127.0.0.1:8000.

Run with Mock SSO (Optional)

To test the SSO flow locally without an external provider:

  1. Start the Mock IdP:

    ./venv/bin/python mock_idp.py

    This server runs on port 8080.

  2. Configure .env: Ensure your .env contains the local SSO configuration:

    OIDC_DISCOVERY_URL=http://localhost:8080/.well-known/openid-configuration
    OIDC_CLIENT_ID=mock_client
    OIDC_CLIENT_SECRET=mock_secret
  3. Login: Go to http://localhost:5173/login, click "SSO Login", and you will be redirected through the mock flow.

Core Capabilities

1. Hybrid Indexing Pipeline

The system strictly separates scanning from indexing using a "Clean Architecture" approach:

  • Polyglot Parsing: Uses ParserFactory to select JavaParser, PythonParser, or TypeScriptParser based on file extension (.java, .py, .ts, .tsx).
  • Graph Builder: Scans import and call statements to build a directed graph of dependencies in project_graph.sqlite3.
  • Vector Store: Indexes code blocks (methods/classes) and optional file-level summaries into ChromaDB collections.

2. Graph-Enriched RAG

The retrieval engine (GraphEnrichedRetriever) goes beyond simple similarity search:

  1. Vector Search: Finds the top-k most semantically relevant code blocks.
  2. Graph Expansion: For each result, looks up its outgoing "calls" edges in the generated graph.
  3. Context Enrichment: Fetches the code/docs of the called dependencies and injects them into the LLM context.
    • Result: The LLM sees not just function A, but also the signature/docs of function B that A calls.

3. "DeepWiki" Documentation Engine

The agentic pipeline (core/documentation) generates a full static documentation site in 5 distinct stages:

  1. File Summarization: Generates technical summaries for every source file.
  2. Feature Detection: Clusters files into "Features" using a hybrid approach (Vector Clustering + LLM Refinement).
  3. Module Synthesis: Generates architectural "READMEs" for every directory, explaining its role in the system.
  4. Project Overview: Synthesizes a high-level "Executive Summary" of the entire codebase.
  5. Site Generation: Outputs a structured HTML site with Mermaid diagrams visualizing the architecture.

API and examples

See USAGE.md for endpoint list and curl examples.

Semantic Indexing Evaluation

A dedicated script evaluate_indexing.py is available to test the quality of feature detection and module synthesis against a controlled Java fixture project.

Running the Evaluation

  1. Ensure the backend environment is active (source venv/bin/activate).

  2. Run the script:

    ./venv/bin/python evaluate_indexing.py

Interpreting Results

The script runs the SemanticSummarizer and FeatureDetector on a mini Java project located in fixtures/mini-java-project. It logs:

  • Module Detection: Whether the auth and billing modules were correctly identified.
  • Feature Detection: Whether "Authentication" and "Billing" features were extracted.
  • Facet Analysis: Recommendations for "Deep Dives" (e.g., Security, Data Flow).

Pass/Fail Criteria:

  • PASS: Core modules and features are found.
  • WARN: Some elements might be missing or misclassified.
  • FAIL: Critical components (like Auth feature) were missed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •