create_pg_super_document

Overview
Disclaimer
Prerequisites
Installation & Setup
Getting Started
Directory & File Structure
GNU GLOBAL Installation and Usage
Recommended Execution Flow
Example Execution
AI Documentation Generation
DuckDB Table Schema & Data Flow Overview
Generated Document Examples
PostgreSQL Codebase Navigation
Advanced Topics
Additional Resources

Overview

create_pg_super_document is a project that generates documentation for all symbols in the PostgreSQL codebase, then utilizes these symbol documents to create high-quality documentation for functionalities, components, and other elements using AI agents.

This repository provides scripts for generating symbol documentation and includes the generated data in multiple formats (Markdown files, DuckDB database files, JSON files, CSV files, etc.).

Additionally, the high-quality documents created using this data are also available here!

Disclaimer

Documents are generated by LLM based AI agents, so accuracy is not guaranteed. Please verify the generated documents with the actual source code

Prerequisites

Assumed PostgreSQL Code Tree

https://github.com/postgres/postgres/tree/92268b35d04c2de416279f187d12f264afa22614
Other code trees should also work, but the DB and CSV files registered in this repository correspond to this code tree.

Dependencies

Python 3.x (see the python_version file for the specific version)
GNU GLOBAL (gtags)
Listed in requirements.txt

Installation & Setup

Install required packages:

pip install -r requirements.txt

(Install GNU GLOBAL earlier as described below if not yet installed.)

Getting Started

Before running the scripts in this repository, it is recommended to create the following directories (skip if they already exist):

mkdir -p data
mkdir -p output/temp

data/ ... For storing AI documentation generation batches, DBs, and various metadata
output/ ... Storage for generated and temporary files (e.g., symbol_references.csv)
output/temp/ ... Storage for temporary intermediate files and in-process results.
Some scripts use output/temp as a working directory.

Directory & File Structure

misc/ ... Miscellaneous files (e.g., pre/post data processing scripts)
- extract_readme_file_header_comments.py ... Extract README header comments
- extract_symbol_references.py ... Extract symbol reference relationships
- import_symbol_reference.py ... Import symbol reference information
- process_symbol_definitions.py ... Process symbol definition information
- filter_frequent_symbol_from_csv.py ... Filter frequent symbols in CSV
- set_file_end_lines.py ... Set file end lines
- update_symbol_types.py ... Update symbol type information
- create_duckdb_index.py ... Create DuckDB index
- update_markdown_links.py ... Update links in markdown files to create cross-references between generated documents
- add_github_location_links.py ... Add GitHub links to the location sections in generated markdown documents
- cleanup_link_backticks.py ... Clean up backticks surrounding markdown links in generated documents
requirements.txt ... Required packages
scripts/ ... Auxiliary scripts
- prepare_cluster.py ... Symbol clustering and AI batch preparation (must be run before documentation generation)
- orchestrator.py ... High-level coordination of AI documentation generation
- mcp_tool.py ... Tool interface (Model Context Protocol) used by the AI agent to retrieve context and persist generated documents
ENTRY_POINTS.md ... Entry point explanations (AI generated)
CODE_TREE.md ... Overview of PostgreSQL's code tree structure (AI generated)

GNU GLOBAL Installation and Usage

In the initial stage of this project, we build symbol definition and reference indices from the PostgreSQL source tree using GNU GLOBAL (gtags). The resulting data is then imported and refined into DuckDB.
Follow the steps below (do this before running the Python scripts).

1. Install GNU GLOBAL

Common installation methods:

macOS (Homebrew)

brew install global

Debian / Ubuntu

sudo apt-get update
sudo apt-get install -y global exuberant-ctags

(If you already use universal-ctags instead of exuberant-ctags, that is fine.)

From Source

wget https://ftp.gnu.org/pub/gnu/global/global-latest.tar.gz
tar xf global-latest.tar.gz
cd global-*/
./configure --prefix=/usr/local
make -j$(nproc)
sudo make install
global --version

2. Obtain the PostgreSQL Source Tree

(If already cloned, skip.)

git clone https://github.com/postgres/postgres.git
cd postgres
# Checkout the specific commit used for the data included in this repository (optional)
git checkout 92268b35d04c2de416279f187d12f264afa22614

3. Create GNU GLOBAL Index Files

Run gtags at the root of the PostgreSQL source directory.

cd /path/to/postgres
gtags --gtagslabel=ctags

Generated key files:

GTAGS : Definitions database
GRTAGS : References database
GPATH : Path mapping

Incremental update after source changes:

gtags --incremental

If you manage multiple branches or trees on the same machine, you can isolate object directories:

export GTAGSOBJDIRPREFIX=$HOME/.gtags/objs
gtags

(Each source root will then get its own object directory automatically.)

4. Basic Usage of the `global` Command

global queries the index to locate symbol definitions and references.

Frequently used options:

Definitions: global -d SYMBOL
References (callers, usages): global -r SYMBOL
Cross reference with line numbers: global -x SYMBOL
Partial/substring match: global -c PATTERN or global -s PATTERN
List symbols defined in a file: global -f path/to/file.c
Regex match across tags: global -a -e 'Exec[A-Za-z0-9_]+'
Detailed reference lines (used in this repo): global -rx SYMBOL

Examples:

# Definition of heap_insert
global -x heap_insert

# All places referencing ExecProcNode
global -rx ExecProcNode

# List all symbols in parse_expr.c
global -f src/backend/parser/parse_expr.c

5. Mapping to This Repository's Scripts

Script	Uses GNU GLOBAL Output	Prerequisite
`create_duckdb_index.py`	`GTAGS` (definitions)	Run `gtags` in PostgreSQL root beforehand
`extract_symbol_references.py`	`global -rx SYMBOL` requiring `GTAGS/GRTAGS/GPATH`	Same
`filter_frequent_symbol_from_csv.py`	Processes CSV produced upstream	Upstream CSV exists
`import_symbol_reference.py`	Filtered reference CSV	DuckDB DB + CSV
Other update/processing scripts	DuckDB tables	`create_duckdb_index.py` completed

When running create_duckdb_index.py, pass the PostgreSQL source directory path as the argument. That directory must already contain the generated GTAGS files.

6. Troubleshooting

Symptom	Possible Cause	Resolution
`global: GTAGS not found.`	Index not created	Run `gtags` at the source root
Reference results empty	Macro expansion / alternative symbol name	Try related identifiers / adjust label / verify commit
Results outdated	Source changed without reindex	Run `gtags --incremental` or regenerate
High memory usage	Full tree reindex repeatedly	Limit scope (invoke in subset), or ensure adequate RAM

Recommended Execution Flow

The scripts in this project incrementally build and process symbol information from the PostgreSQL code tree into a DuckDB database, then generate documentation using AI agents.
The processing targets and recommended execution order for each script are as follows:

1. Extract Symbol Definition Information & Build DB

create_duckdb_index.py
Based on the index output from GNU GLOBAL, creates and stores the symbol_definitions table in global_symbols.db.
- Main columns:
  - id (primary key), symbol_name, file_path, line_num_start, line_num_end, line_content, contents
- Must be run first to import all symbol information into the DB.

2. Set File End Lines

set_file_end_lines.py
Sets the range (line_num_end) for each symbol definition within the file.
- Target: symbol_definitions table (updates the line_num_end column)

3. Extract & Organize Symbol Reference Relationships

extract_symbol_references.py
Uses global -rx for each symbol in symbol_definitions to output reference relationships as symbol_references.csv.
filter_frequent_symbol_from_csv.py
Filters frequent symbols and unnecessary references to generate symbol_references_filtered.csv.
import_symbol_reference.py
Imports symbol_references_filtered.csv into the symbol_reference table in global_symbols.db.
- Columns:
  - from_node (source symbol ID), to_node (target symbol ID), line_num_in_from

4. Automatic Assignment of Symbol Type Information

update_symbol_types.py
Adds a symbol_type column to the symbol_definitions table and automatically estimates and records the type (function/variable/type, etc.) .

5. Additional Processing & Deduplication of Symbol Definitions

process_symbol_definitions.py
Cleans up duplicate definitions and unnecessary data, outputs statistics, etc.

6. Symbol Clustering & Batch Preparation (Required Before AI Documentation Generation)

scripts/prepare_cluster.py
Automatically clusters symbols based on dependencies and prepares batches (e.g., data/processing_batches.json) for AI documentation generation.
Be sure to run this step.

7. Documentation Generation & AI Integration

scripts/orchestrator.py
Orchestrates automatic documentation generation and management by AI (e.g., Claude Code) based on symbol, reference, and cluster information in DuckDB.
- Also supports batch processing plans via data/processing_batches.json

Example Execution (Typical Flow)

# 0. (Prerequisite) Prepare GNU GLOBAL index inside PostgreSQL source tree
#    cd /path/to/postgres && gtags --gtagslabel=ctags

# 1. Import symbol definitions into DB
python misc/create_duckdb_index.py <source_directory>

# 2. Complete symbol range information
python misc/set_file_end_lines.py

# 3. Extract and filter reference relationships
python misc/extract_symbol_references.py
python misc/filter_frequent_symbol_from_csv.py

# 4. Import reference information into DB
python misc/import_symbol_reference.py

# 5. Assign symbol type information
python misc/update_symbol_types.py

# 6. Deduplication and final processing
python misc/process_symbol_definitions.py

# 7. Symbol clustering and batch generation (required)
python misc/scripts/prepare_cluster.py

# 8. AI-based documentation generation (tool-assisted)
python scripts/orchestrator.py

# 9. Update internal links between generated documents
python misc/update_markdown_links.py

# 10. Add GitHub location links to the generated documents
python misc/add_github_location_links.py

# 11. Clean up backticks around markdown links
python misc/cleanup_link_backticks.py

AI Documentation Generation

Integration of Document Generation AI & original MCP tool (Roles of orchestrator.py / mcp_tool.py)

scripts/orchestrator.py
The agent creates Markdown (.md) files in the output/temp directory via mcp_tool.py.
orchestrator.py extracts the contents of these md files in output/temp and adds them to the documents table in global_symbols.db, enabling AI-based documentation for each symbol.
scripts/mcp_tool.py (Tool Interface for the AI Agent)
mcp_tool.py implements the tool-side interface (e.g., Model Context Protocol style) that the AI agent invokes to obtain exactly the contextual information required to write a symbol's documentation. Its responsibilities typically include:
- Fetching symbol definition records (name, file path, source snippet, start/end lines) from the DuckDB database.
- Supplying reference relationships (incoming/outgoing dependencies) so the agent can describe usage patterns.
- Providing cluster/batch metadata (e.g., related symbols processed together) to encourage consistent narrative across a group of symbols.
- Returning any existing previously generated documents (for incremental refinement or avoidance of duplication).
- Accepting newly generated Markdown content from the agent and writing it to output/temp/*.md so orchestrator.py can ingest it into the documents table.
- (Optionally) Logging tool invocations or marking progress so that interrupted runs can resume.
Conceptual interaction flow:
1. Orchestrator selects next symbol (or batch) from planning data (data/processing_batches.json).
2. AI agent requests context via exposed tool methods in mcp_tool.py (e.g., get_symbol_definition, get_references, get_cluster_context).
3. AI agent generates a Markdown document and calls a write/persist method in mcp_tool.py to store it under output/temp/.
4. Orchestrator later scans output/temp/ and imports finalized content into DuckDB.
This separation keeps the agent stateless regarding storage details, while mcp_tool.py abstracts data retrieval and persistence. If you extend documentation scope (e.g., add performance notes, code metrics), you can augment mcp_tool.py with new retrieval endpoints without modifying the orchestration logic.
Progress Management and Logging
Progress of documentation generation/import and batch status logs by orchestrator.py are recorded in metadata.duckdb.

Example: Registration Flow from md Files under output/temp to the documents Table

Markdown Generation by AI Agent
- mcp_tool.py provides context; the agent creates md files for each symbol/file in output/temp/
DB Registration by orchestrator.py
- Searches for output/temp/*.md, extracts contents
- Adds contents to the documents table in global_symbols.db
Progress Logging
- Records processing status and error information for each batch/file in metadata.duckdb

DuckDB Table Schema & Data Flow Overview

global_symbols.db

The main database containing symbol information from the PostgreSQL source code.

symbol_definitions table

Main table for symbol definitions. Base data for all scripts.

Column	Type	Description
`id`	INTEGER PRIMARY KEY	Unique identifier for each symbol
`symbol_name`	VARCHAR NOT NULL	Name of the symbol
`file_path`	VARCHAR NOT NULL	Path to the file containing the symbol
`line_num_start`	INTEGER NOT NULL	Starting line number of the symbol definition
`line_num_end`	INTEGER NOT NULL DEFAULT 0	Ending line number of the symbol definition
`line_content`	VARCHAR NOT NULL	Content of the line where symbol is defined
`contents`	VARCHAR DEFAULT ''	Full content of the symbol definition
`symbol_type`	VARCHAR	Type of symbol (added by `update_symbol_types.py`)

symbol_reference table

Records reference relationships between symbols (created by import_symbol_reference.py).

Column	Type	Description
`from_node`	INTEGER NOT NULL	ID of the source symbol
`to_node`	INTEGER NOT NULL	ID of the target symbol
`line_num_in_from`	INTEGER NOT NULL	Line number in source file where reference occurs

data/metadata.duckdb

Manages metadata for AI documentation generation process (created by scripts/prepare_cluster.py).

symbols table

Copy of symbol information with additional metadata.

Column	Type	Description
`id`	INTEGER PRIMARY KEY	Symbol ID (same as in global_symbols.db)
`symbol_name`	VARCHAR	Name of the symbol
`symbol_type`	VARCHAR	Type of symbol (f=function, s=struct, v=variable)
`file_path`	VARCHAR	Path to the file containing the symbol
`module`	VARCHAR	Module/component the symbol belongs to
`start_line`	INTEGER	Starting line number
`end_line`	INTEGER	Ending line number
`layer`	INTEGER	Dependency layer (0=base, higher=dependent)
`cluster_id`	INTEGER	ID of the cluster this symbol belongs to
`created_at`	TIMESTAMP	When the record was created

dependencies table

Stores dependency relationships (created by scripts/prepare_cluster.py).

Column	Type	Description
`from_node`	INTEGER	Source symbol ID
`to_node`	INTEGER	Target symbol ID
PRIMARY KEY	(from_node, to_node)	Composite key ensuring unique relationships

clusters table

Stores information about symbol clusters for batch processing.

Column	Type	Description
`cluster_id`	INTEGER PRIMARY KEY	Unique cluster identifier
`cluster_type`	VARCHAR	Type of cluster
`layer`	INTEGER	Dependency layer
`symbols`	JSON	Array of symbol IDs in this cluster
`estimated_tokens`	INTEGER	Estimated token count for processing
`created_at`	TIMESTAMP	When the cluster was created

data/documents.duckdb

Stores generated documentation (created by scripts/orchestrator.py).

documents table

Contains AI-generated documentation for each symbol.

Column	Type	Description
`symbol_id`	INTEGER PRIMARY KEY	Symbol ID
`symbol_name`	VARCHAR	Name of the symbol
`symbol_type`	VARCHAR	Type of symbol
`layer`	INTEGER	Dependency layer
`content`	TEXT	Full markdown documentation
`summary`	TEXT	Brief summary of the symbol
`dependencies`	JSON	List of dependencies
`related_symbols`	JSON	List of related symbols
`created_at`	TIMESTAMP	When document was created
`updated_at`	TIMESTAMP	When document was last updated

processing_log table

Tracks the progress of documentation generation batches.

Column	Type	Description
`batch_id`	INTEGER PRIMARY KEY	Unique batch identifier
`symbol_ids`	JSON	Array of symbol IDs in this batch
`status`	VARCHAR	Processing status (pending/processing/completed/failed)
`started_at`	TIMESTAMP	When processing started
`completed_at`	TIMESTAMP	When processing completed
`error_message`	TEXT	Error message if failed
`processed_count`	INTEGER	Number of symbols processed

Generated Document Examples

PostgreSQL Codebase Navigation

ENTRY_POINTS.md : Entry points (functions) of PostgreSQL's each processes are listed
CODE_TREE.md : Overview of PostgreSQL's code tree structure (AI generated)

Advanced Topics

Topic-Specific Documentation Generation: WAL System (example of current trying things)

The comprehensive PostgreSQL WAL (Write-Ahead Logging) system documentation in topic_specific_generated_docs/about_wal/ was generated using an AI-powered multi-agent orchestration approach. This section describes the generation process and mechanisms involved.

Overview of Generation Process

The documentation was created by orchestrating three specialized AI subagents through Claude Code's interactive mode, leveraging an MCP (Model Context Protocol) server for PostgreSQL symbol information access.

Key Components

1. MCP Server Setup (any-script-mcp)

The MCP server provides programmatic access to PostgreSQL symbol information through specialized tools:

Installation: Clone any-script-mcp into the project directory
Configuration: Run ../any-script-mcp/add_any_script_mcp_to_cc.sh in cloned any-script-mcp repo dir to register the MCP server
Settings: User-scope registration with configuration stored at ~/.config/any-script-mcp/config.yaml
Tools Directory: Custom tool configurations in any-script-mcp/config.yaml
- This file is same with ~/.config/any-script-mcp/config.yaml

The MCP server exposes PostgreSQL-specific tools:

pg_symbol_source(symbol) - Retrieve source code for symbols
pg_symbol_overview(symbol) - Get concise symbol overviews
pg_symbol_document(symbol) - Fetch detailed documentation
pg_references_from(symbol) - List symbols referenced by a given symbol
pg_references_to(symbol) - List symbols that reference a given symbol

2. AI Subagents (`.claude/agents/`)

Three specialized subagents handle different aspects of documentation generation:

architecture-analyzer - Analyzes PostgreSQL codebase structure and builds dependency maps
detail-documenter - Creates detailed technical documentation for individual components
integration-optimizer - Integrates component documentation into cohesive final documents

Each subagent is defined with specific capabilities and prompts in the .claude/agents/ directory.

3. Orchestration Prompt (`prompts/generate_document_about_wal.md`)

The main orchestration prompt coordinates the three-stage documentation generation process:

Stage 1: Architecture Analysis

Analyzes WAL subsystem starting from key entry points (XLogInsert, WalSndLoop, WalReceiverMain, StartupXLOG, etc.)
Builds comprehensive dependency maps with depth-3 traversal
Generates architecture_map.json, key_symbols.txt, and initial_outline.md

Stage 2: Detailed Documentation

Documents all symbols with importance scores > 0.8
Creates component-specific documentation files
Generates minimum 5 technical diagrams (Mermaid format)
Focuses on critical areas like XLogRecord structure, LSN management, and replication mechanics

Stage 3: Integration and Optimization

Merges component documentation into unified documents
Eliminates redundancy while maintaining clarity
Adds cross-references and navigation aids
Produces final deliverables including:
- wal_complete_documentation.md - Comprehensive technical guide
- wal_quick_reference.md - 2-page developer reference
- wal_api_reference.md - API signatures catalog
- quality_report.md - Coverage metrics and analysis

Generation Steps

Prerequisites: Ensure MCP server is configured and Claude Code has access to subagent definitions
Execution:
- Open Claude Code in interactive mode
- Copy the content from prompts/generate_document_about_wal.md
- Paste as a prompt in Claude Code
- The orchestrator automatically executes all three stages sequentially
Output: Generated documentation is saved to project root directory...

Generated Documentation Structure

The final WAL documentation includes:

Core Documentation: comprehensive technical guide covering all WAL components
Component Files: Separate documentation for generation, writing, replication, and recovery subsystems
Technical Diagrams: 5+ Mermaid diagrams illustrating architecture and data flows
Reference Materials: Quick reference guide and API documentation

Benefits of This Approach

Agent Can Use Pre Generated Knowledges: Knowledges without on-time context consumption make output better, maybe :)
Systematic Coverage: Ensures all critical symbols and paths are documented
Consistency: Standardized terminology and structure across all documentation
Depth and Breadth: Combines high-level architecture views with detailed implementation specifics
Maintainability: Modular generation allows for incremental updates
Quality Assurance: Built-in validation and coverage metrics

This multi-agent approach demonstrates how AI can be effectively orchestrated to generate comprehensive, professional-quality technical documentation for complex systems like PostgreSQL's WAL subsystem.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
any-script-mcp		any-script-mcp
data		data
experimental		experimental
generated_docs		generated_docs
misc		misc
official_doc_in_md		official_doc_in_md
prompts		prompts
scripts		scripts
src		src
topic_specific_generated_docs		topic_specific_generated_docs
.gitattributes		.gitattributes
CODE_TREE.md		CODE_TREE.md
CODE_TREE_CONTRIB.md		CODE_TREE_CONTRIB.md
ENTRY_POINTS.md		ENTRY_POINTS.md
LICENSE		LICENSE
README.md		README.md
global_symbols.db		global_symbols.db
python_version		python_version
requirements.txt		requirements.txt

License

ryogrid/create_pg_super_document

Folders and files

Latest commit

History

Repository files navigation