Skip to content

create_pg_super_document is a project that generates documentation for all symbols in the PostgreSQL codebase, then utilizes these symbol documents to create high-quality documentation for functionalities, components, and other elements using AI agents

License

Notifications You must be signed in to change notification settings

ryogrid/create_pg_super_document

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

create_pg_super_document

Table of Contents

Overview

create_pg_super_document is a project that generates documentation for all symbols in the PostgreSQL codebase, then utilizes these symbol documents to create high-quality documentation for functionalities, components, and other elements using AI agents.

This repository provides scripts for generating symbol documentation and includes the generated data in multiple formats (Markdown files, DuckDB database files, JSON files, CSV files, etc.).

Additionally, the high-quality documents created using this data are also available here!

Disclaimer

Documents are generated by LLM based AI agents, so accuracy is not guaranteed. Please verify the generated documents with the actual source code

Prerequisites

Assumed PostgreSQL Code Tree

Dependencies

  • Python 3.x (see the python_version file for the specific version)
  • GNU GLOBAL (gtags)
  • Listed in requirements.txt

Installation & Setup

Install required packages:

pip install -r requirements.txt

(Install GNU GLOBAL earlier as described below if not yet installed.)

Getting Started

Before running the scripts in this repository, it is recommended to create the following directories (skip if they already exist):

mkdir -p data
mkdir -p output/temp
  • data/ ... For storing AI documentation generation batches, DBs, and various metadata
  • output/ ... Storage for generated and temporary files (e.g., symbol_references.csv)
  • output/temp/ ... Storage for temporary intermediate files and in-process results.
    Some scripts use output/temp as a working directory.

Directory & File Structure

  • misc/ ... Miscellaneous files (e.g., pre/post data processing scripts)
    • extract_readme_file_header_comments.py ... Extract README header comments
    • extract_symbol_references.py ... Extract symbol reference relationships
    • import_symbol_reference.py ... Import symbol reference information
    • process_symbol_definitions.py ... Process symbol definition information
    • filter_frequent_symbol_from_csv.py ... Filter frequent symbols in CSV
    • set_file_end_lines.py ... Set file end lines
    • update_symbol_types.py ... Update symbol type information
    • create_duckdb_index.py ... Create DuckDB index
    • update_markdown_links.py ... Update links in markdown files to create cross-references between generated documents
    • add_github_location_links.py ... Add GitHub links to the location sections in generated markdown documents
    • cleanup_link_backticks.py ... Clean up backticks surrounding markdown links in generated documents
  • requirements.txt ... Required packages
  • scripts/ ... Auxiliary scripts
    • prepare_cluster.py ... Symbol clustering and AI batch preparation (must be run before documentation generation)
    • orchestrator.py ... High-level coordination of AI documentation generation
    • mcp_tool.py ... Tool interface (Model Context Protocol) used by the AI agent to retrieve context and persist generated documents
  • ENTRY_POINTS.md ... Entry point explanations (AI generated)
  • CODE_TREE.md ... Overview of PostgreSQL's code tree structure (AI generated)

GNU GLOBAL Installation and Usage

In the initial stage of this project, we build symbol definition and reference indices from the PostgreSQL source tree using GNU GLOBAL (gtags). The resulting data is then imported and refined into DuckDB.
Follow the steps below (do this before running the Python scripts).

1. Install GNU GLOBAL

Common installation methods:

macOS (Homebrew)

brew install global

Debian / Ubuntu

sudo apt-get update
sudo apt-get install -y global exuberant-ctags

(If you already use universal-ctags instead of exuberant-ctags, that is fine.)

From Source

wget https://ftp.gnu.org/pub/gnu/global/global-latest.tar.gz
tar xf global-latest.tar.gz
cd global-*/
./configure --prefix=/usr/local
make -j$(nproc)
sudo make install
global --version

2. Obtain the PostgreSQL Source Tree

(If already cloned, skip.)

git clone https://github.com/postgres/postgres.git
cd postgres
# Checkout the specific commit used for the data included in this repository (optional)
git checkout 92268b35d04c2de416279f187d12f264afa22614

3. Create GNU GLOBAL Index Files

Run gtags at the root of the PostgreSQL source directory.

cd /path/to/postgres
gtags --gtagslabel=ctags

Generated key files:

  • GTAGS : Definitions database
  • GRTAGS : References database
  • GPATH : Path mapping

Incremental update after source changes:

gtags --incremental

If you manage multiple branches or trees on the same machine, you can isolate object directories:

export GTAGSOBJDIRPREFIX=$HOME/.gtags/objs
gtags

(Each source root will then get its own object directory automatically.)

4. Basic Usage of the global Command

global queries the index to locate symbol definitions and references.

Frequently used options:

  • Definitions: global -d SYMBOL
  • References (callers, usages): global -r SYMBOL
  • Cross reference with line numbers: global -x SYMBOL
  • Partial/substring match: global -c PATTERN or global -s PATTERN
  • List symbols defined in a file: global -f path/to/file.c
  • Regex match across tags: global -a -e 'Exec[A-Za-z0-9_]+'
  • Detailed reference lines (used in this repo): global -rx SYMBOL

Examples:

# Definition of heap_insert
global -x heap_insert

# All places referencing ExecProcNode
global -rx ExecProcNode

# List all symbols in parse_expr.c
global -f src/backend/parser/parse_expr.c

5. Mapping to This Repository's Scripts

Script Uses GNU GLOBAL Output Prerequisite
create_duckdb_index.py GTAGS (definitions) Run gtags in PostgreSQL root beforehand
extract_symbol_references.py global -rx SYMBOL requiring GTAGS/GRTAGS/GPATH Same
filter_frequent_symbol_from_csv.py Processes CSV produced upstream Upstream CSV exists
import_symbol_reference.py Filtered reference CSV DuckDB DB + CSV
Other update/processing scripts DuckDB tables create_duckdb_index.py completed

When running create_duckdb_index.py, pass the PostgreSQL source directory path as the argument. That directory must already contain the generated GTAGS files.

6. Troubleshooting

Symptom Possible Cause Resolution
global: GTAGS not found. Index not created Run gtags at the source root
Reference results empty Macro expansion / alternative symbol name Try related identifiers / adjust label / verify commit
Results outdated Source changed without reindex Run gtags --incremental or regenerate
High memory usage Full tree reindex repeatedly Limit scope (invoke in subset), or ensure adequate RAM

Recommended Execution Flow

The scripts in this project incrementally build and process symbol information from the PostgreSQL code tree into a DuckDB database, then generate documentation using AI agents.
The processing targets and recommended execution order for each script are as follows:

1. Extract Symbol Definition Information & Build DB

  • create_duckdb_index.py
    Based on the index output from GNU GLOBAL, creates and stores the symbol_definitions table in global_symbols.db.
    • Main columns:
      • id (primary key), symbol_name, file_path, line_num_start, line_num_end, line_content, contents
    • Must be run first to import all symbol information into the DB.

2. Set File End Lines

  • set_file_end_lines.py
    Sets the range (line_num_end) for each symbol definition within the file.
    • Target: symbol_definitions table (updates the line_num_end column)

3. Extract & Organize Symbol Reference Relationships

  • extract_symbol_references.py
    Uses global -rx for each symbol in symbol_definitions to output reference relationships as symbol_references.csv.

  • filter_frequent_symbol_from_csv.py
    Filters frequent symbols and unnecessary references to generate symbol_references_filtered.csv.

  • import_symbol_reference.py
    Imports symbol_references_filtered.csv into the symbol_reference table in global_symbols.db.

    • Columns:
      • from_node (source symbol ID), to_node (target symbol ID), line_num_in_from

4. Automatic Assignment of Symbol Type Information

  • update_symbol_types.py
    Adds a symbol_type column to the symbol_definitions table and automatically estimates and records the type (function/variable/type, etc.) .

5. Additional Processing & Deduplication of Symbol Definitions

  • process_symbol_definitions.py
    Cleans up duplicate definitions and unnecessary data, outputs statistics, etc.

6. Symbol Clustering & Batch Preparation (Required Before AI Documentation Generation)

  • scripts/prepare_cluster.py
    Automatically clusters symbols based on dependencies and prepares batches (e.g., data/processing_batches.json) for AI documentation generation.
    Be sure to run this step.

7. Documentation Generation & AI Integration

  • scripts/orchestrator.py
    Orchestrates automatic documentation generation and management by AI (e.g., Claude Code) based on symbol, reference, and cluster information in DuckDB.
    • Also supports batch processing plans via data/processing_batches.json

Example Execution (Typical Flow)

# 0. (Prerequisite) Prepare GNU GLOBAL index inside PostgreSQL source tree
#    cd /path/to/postgres && gtags --gtagslabel=ctags

# 1. Import symbol definitions into DB
python misc/create_duckdb_index.py <source_directory>

# 2. Complete symbol range information
python misc/set_file_end_lines.py

# 3. Extract and filter reference relationships
python misc/extract_symbol_references.py
python misc/filter_frequent_symbol_from_csv.py

# 4. Import reference information into DB
python misc/import_symbol_reference.py

# 5. Assign symbol type information
python misc/update_symbol_types.py

# 6. Deduplication and final processing
python misc/process_symbol_definitions.py

# 7. Symbol clustering and batch generation (required)
python misc/scripts/prepare_cluster.py

# 8. AI-based documentation generation (tool-assisted)
python scripts/orchestrator.py

# 9. Update internal links between generated documents
python misc/update_markdown_links.py

# 10. Add GitHub location links to the generated documents
python misc/add_github_location_links.py

# 11. Clean up backticks around markdown links
python misc/cleanup_link_backticks.py

AI Documentation Generation

Integration of Document Generation AI & original MCP tool (Roles of orchestrator.py / mcp_tool.py)

  • scripts/orchestrator.py
    The agent creates Markdown (.md) files in the output/temp directory via mcp_tool.py.
    orchestrator.py extracts the contents of these md files in output/temp and adds them to the documents table in global_symbols.db, enabling AI-based documentation for each symbol.

  • scripts/mcp_tool.py (Tool Interface for the AI Agent)
    mcp_tool.py implements the tool-side interface (e.g., Model Context Protocol style) that the AI agent invokes to obtain exactly the contextual information required to write a symbol's documentation. Its responsibilities typically include:

    • Fetching symbol definition records (name, file path, source snippet, start/end lines) from the DuckDB database.
    • Supplying reference relationships (incoming/outgoing dependencies) so the agent can describe usage patterns.
    • Providing cluster/batch metadata (e.g., related symbols processed together) to encourage consistent narrative across a group of symbols.
    • Returning any existing previously generated documents (for incremental refinement or avoidance of duplication).
    • Accepting newly generated Markdown content from the agent and writing it to output/temp/*.md so orchestrator.py can ingest it into the documents table.
    • (Optionally) Logging tool invocations or marking progress so that interrupted runs can resume.

    Conceptual interaction flow:

    1. Orchestrator selects next symbol (or batch) from planning data (data/processing_batches.json).
    2. AI agent requests context via exposed tool methods in mcp_tool.py (e.g., get_symbol_definition, get_references, get_cluster_context).
    3. AI agent generates a Markdown document and calls a write/persist method in mcp_tool.py to store it under output/temp/.
    4. Orchestrator later scans output/temp/ and imports finalized content into DuckDB.

    This separation keeps the agent stateless regarding storage details, while mcp_tool.py abstracts data retrieval and persistence. If you extend documentation scope (e.g., add performance notes, code metrics), you can augment mcp_tool.py with new retrieval endpoints without modifying the orchestration logic.

  • Progress Management and Logging
    Progress of documentation generation/import and batch status logs by orchestrator.py are recorded in metadata.duckdb.

Example: Registration Flow from md Files under output/temp to the documents Table

  1. Markdown Generation by AI Agent
    • mcp_tool.py provides context; the agent creates md files for each symbol/file in output/temp/
  2. DB Registration by orchestrator.py
    • Searches for output/temp/*.md, extracts contents
    • Adds contents to the documents table in global_symbols.db
  3. Progress Logging
    • Records processing status and error information for each batch/file in metadata.duckdb

DuckDB Table Schema & Data Flow Overview

global_symbols.db

The main database containing symbol information from the PostgreSQL source code.

symbol_definitions table

Main table for symbol definitions. Base data for all scripts.

Column Type Description
id INTEGER PRIMARY KEY Unique identifier for each symbol
symbol_name VARCHAR NOT NULL Name of the symbol
file_path VARCHAR NOT NULL Path to the file containing the symbol
line_num_start INTEGER NOT NULL Starting line number of the symbol definition
line_num_end INTEGER NOT NULL DEFAULT 0 Ending line number of the symbol definition
line_content VARCHAR NOT NULL Content of the line where symbol is defined
contents VARCHAR DEFAULT '' Full content of the symbol definition
symbol_type VARCHAR Type of symbol (added by update_symbol_types.py)

symbol_reference table

Records reference relationships between symbols (created by import_symbol_reference.py).

Column Type Description
from_node INTEGER NOT NULL ID of the source symbol
to_node INTEGER NOT NULL ID of the target symbol
line_num_in_from INTEGER NOT NULL Line number in source file where reference occurs

data/metadata.duckdb

Manages metadata for AI documentation generation process (created by scripts/prepare_cluster.py).

symbols table

Copy of symbol information with additional metadata.

Column Type Description
id INTEGER PRIMARY KEY Symbol ID (same as in global_symbols.db)
symbol_name VARCHAR Name of the symbol
symbol_type VARCHAR Type of symbol (f=function, s=struct, v=variable)
file_path VARCHAR Path to the file containing the symbol
module VARCHAR Module/component the symbol belongs to
start_line INTEGER Starting line number
end_line INTEGER Ending line number
layer INTEGER Dependency layer (0=base, higher=dependent)
cluster_id INTEGER ID of the cluster this symbol belongs to
created_at TIMESTAMP When the record was created

dependencies table

Stores dependency relationships (created by scripts/prepare_cluster.py).

Column Type Description
from_node INTEGER Source symbol ID
to_node INTEGER Target symbol ID
PRIMARY KEY (from_node, to_node) Composite key ensuring unique relationships

clusters table

Stores information about symbol clusters for batch processing.

Column Type Description
cluster_id INTEGER PRIMARY KEY Unique cluster identifier
cluster_type VARCHAR Type of cluster
layer INTEGER Dependency layer
symbols JSON Array of symbol IDs in this cluster
estimated_tokens INTEGER Estimated token count for processing
created_at TIMESTAMP When the cluster was created

data/documents.duckdb

Stores generated documentation (created by scripts/orchestrator.py).

documents table

Contains AI-generated documentation for each symbol.

Column Type Description
symbol_id INTEGER PRIMARY KEY Symbol ID
symbol_name VARCHAR Name of the symbol
symbol_type VARCHAR Type of symbol
layer INTEGER Dependency layer
content TEXT Full markdown documentation
summary TEXT Brief summary of the symbol
dependencies JSON List of dependencies
related_symbols JSON List of related symbols
created_at TIMESTAMP When document was created
updated_at TIMESTAMP When document was last updated

processing_log table

Tracks the progress of documentation generation batches.

Column Type Description
batch_id INTEGER PRIMARY KEY Unique batch identifier
symbol_ids JSON Array of symbol IDs in this batch
status VARCHAR Processing status (pending/processing/completed/failed)
started_at TIMESTAMP When processing started
completed_at TIMESTAMP When processing completed
error_message TEXT Error message if failed
processed_count INTEGER Number of symbols processed

Generated Document Examples

PostgreSQL Codebase Navigation

  • ENTRY_POINTS.md : Entry points (functions) of PostgreSQL's each processes are listed
  • CODE_TREE.md : Overview of PostgreSQL's code tree structure (AI generated)

Advanced Topics

Topic-Specific Documentation Generation: WAL System (example of current trying things)

The comprehensive PostgreSQL WAL (Write-Ahead Logging) system documentation in topic_specific_generated_docs/about_wal/ was generated using an AI-powered multi-agent orchestration approach. This section describes the generation process and mechanisms involved.

Overview of Generation Process

The documentation was created by orchestrating three specialized AI subagents through Claude Code's interactive mode, leveraging an MCP (Model Context Protocol) server for PostgreSQL symbol information access.

Key Components

1. MCP Server Setup (any-script-mcp)

The MCP server provides programmatic access to PostgreSQL symbol information through specialized tools:

  • Installation: Clone any-script-mcp into the project directory
  • Configuration: Run ../any-script-mcp/add_any_script_mcp_to_cc.sh in cloned any-script-mcp repo dir to register the MCP server
  • Settings: User-scope registration with configuration stored at ~/.config/any-script-mcp/config.yaml
  • Tools Directory: Custom tool configurations in any-script-mcp/config.yaml
    • This file is same with ~/.config/any-script-mcp/config.yaml

The MCP server exposes PostgreSQL-specific tools:

  • pg_symbol_source(symbol) - Retrieve source code for symbols
  • pg_symbol_overview(symbol) - Get concise symbol overviews
  • pg_symbol_document(symbol) - Fetch detailed documentation
  • pg_references_from(symbol) - List symbols referenced by a given symbol
  • pg_references_to(symbol) - List symbols that reference a given symbol
2. AI Subagents (.claude/agents/)

Three specialized subagents handle different aspects of documentation generation:

  1. architecture-analyzer - Analyzes PostgreSQL codebase structure and builds dependency maps
  2. detail-documenter - Creates detailed technical documentation for individual components
  3. integration-optimizer - Integrates component documentation into cohesive final documents

Each subagent is defined with specific capabilities and prompts in the .claude/agents/ directory.

3. Orchestration Prompt (prompts/generate_document_about_wal.md)

The main orchestration prompt coordinates the three-stage documentation generation process:

Stage 1: Architecture Analysis

  • Analyzes WAL subsystem starting from key entry points (XLogInsert, WalSndLoop, WalReceiverMain, StartupXLOG, etc.)
  • Builds comprehensive dependency maps with depth-3 traversal
  • Generates architecture_map.json, key_symbols.txt, and initial_outline.md

Stage 2: Detailed Documentation

  • Documents all symbols with importance scores > 0.8
  • Creates component-specific documentation files
  • Generates minimum 5 technical diagrams (Mermaid format)
  • Focuses on critical areas like XLogRecord structure, LSN management, and replication mechanics

Stage 3: Integration and Optimization

  • Merges component documentation into unified documents
  • Eliminates redundancy while maintaining clarity
  • Adds cross-references and navigation aids
  • Produces final deliverables including:
    • wal_complete_documentation.md - Comprehensive technical guide
    • wal_quick_reference.md - 2-page developer reference
    • wal_api_reference.md - API signatures catalog
    • quality_report.md - Coverage metrics and analysis

Generation Steps

  1. Prerequisites: Ensure MCP server is configured and Claude Code has access to subagent definitions

  2. Execution:

    • Open Claude Code in interactive mode
    • Copy the content from prompts/generate_document_about_wal.md
    • Paste as a prompt in Claude Code
    • The orchestrator automatically executes all three stages sequentially
  3. Output: Generated documentation is saved to project root directory...

Generated Documentation Structure

The final WAL documentation includes:

  • Core Documentation: comprehensive technical guide covering all WAL components
  • Component Files: Separate documentation for generation, writing, replication, and recovery subsystems
  • Technical Diagrams: 5+ Mermaid diagrams illustrating architecture and data flows
  • Reference Materials: Quick reference guide and API documentation

Benefits of This Approach

  • Agent Can Use Pre Generated Knowledges: Knowledges without on-time context consumption make output better, maybe :)
  • Systematic Coverage: Ensures all critical symbols and paths are documented
  • Consistency: Standardized terminology and structure across all documentation
  • Depth and Breadth: Combines high-level architecture views with detailed implementation specifics
  • Maintainability: Modular generation allows for incremental updates
  • Quality Assurance: Built-in validation and coverage metrics

This multi-agent approach demonstrates how AI can be effectively orchestrated to generate comprehensive, professional-quality technical documentation for complex systems like PostgreSQL's WAL subsystem.

Additional Resources

Other Resources to Understand PostgreSQL Code Tree (Not my works)

About

create_pg_super_document is a project that generates documentation for all symbols in the PostgreSQL codebase, then utilizes these symbol documents to create high-quality documentation for functionalities, components, and other elements using AI agents

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •