- Overview
- Disclaimer
- Prerequisites
- Installation & Setup
- Getting Started
- Directory & File Structure
- GNU GLOBAL Installation and Usage
- Recommended Execution Flow
- Example Execution
- AI Documentation Generation
- DuckDB Table Schema & Data Flow Overview
- Generated Document Examples
- PostgreSQL Codebase Navigation
- Advanced Topics
- Additional Resources
create_pg_super_document is a project that generates documentation for all symbols in the PostgreSQL codebase, then utilizes these symbol documents to create high-quality documentation for functionalities, components, and other elements using AI agents.
This repository provides scripts for generating symbol documentation and includes the generated data in multiple formats (Markdown files, DuckDB database files, JSON files, CSV files, etc.).
Additionally, the high-quality documents created using this data are also available here!
Documents are generated by LLM based AI agents, so accuracy is not guaranteed. Please verify the generated documents with the actual source code
- https://github.com/postgres/postgres/tree/92268b35d04c2de416279f187d12f264afa22614
- Other code trees should also work, but the DB and CSV files registered in this repository correspond to this code tree.
- Python 3.x (see the
python_versionfile for the specific version) - GNU GLOBAL (gtags)
- Listed in
requirements.txt
Install required packages:
pip install -r requirements.txt(Install GNU GLOBAL earlier as described below if not yet installed.)
Before running the scripts in this repository, it is recommended to create the following directories (skip if they already exist):
mkdir -p data
mkdir -p output/tempdata/... For storing AI documentation generation batches, DBs, and various metadataoutput/... Storage for generated and temporary files (e.g.,symbol_references.csv)output/temp/... Storage for temporary intermediate files and in-process results.
Some scripts useoutput/tempas a working directory.
misc/... Miscellaneous files (e.g., pre/post data processing scripts)extract_readme_file_header_comments.py... Extract README header commentsextract_symbol_references.py... Extract symbol reference relationshipsimport_symbol_reference.py... Import symbol reference informationprocess_symbol_definitions.py... Process symbol definition informationfilter_frequent_symbol_from_csv.py... Filter frequent symbols in CSVset_file_end_lines.py... Set file end linesupdate_symbol_types.py... Update symbol type informationcreate_duckdb_index.py... Create DuckDB indexupdate_markdown_links.py... Update links in markdown files to create cross-references between generated documentsadd_github_location_links.py... Add GitHub links to the location sections in generated markdown documentscleanup_link_backticks.py... Clean up backticks surrounding markdown links in generated documents
requirements.txt... Required packagesscripts/... Auxiliary scriptsprepare_cluster.py... Symbol clustering and AI batch preparation (must be run before documentation generation)orchestrator.py... High-level coordination of AI documentation generationmcp_tool.py... Tool interface (Model Context Protocol) used by the AI agent to retrieve context and persist generated documents
ENTRY_POINTS.md... Entry point explanations (AI generated)CODE_TREE.md... Overview of PostgreSQL's code tree structure (AI generated)
In the initial stage of this project, we build symbol definition and reference indices from the PostgreSQL source tree using GNU GLOBAL (gtags). The resulting data is then imported and refined into DuckDB.
Follow the steps below (do this before running the Python scripts).
Common installation methods:
brew install globalsudo apt-get update
sudo apt-get install -y global exuberant-ctags(If you already use universal-ctags instead of exuberant-ctags, that is fine.)
wget https://ftp.gnu.org/pub/gnu/global/global-latest.tar.gz
tar xf global-latest.tar.gz
cd global-*/
./configure --prefix=/usr/local
make -j$(nproc)
sudo make install
global --version(If already cloned, skip.)
git clone https://github.com/postgres/postgres.git
cd postgres
# Checkout the specific commit used for the data included in this repository (optional)
git checkout 92268b35d04c2de416279f187d12f264afa22614Run gtags at the root of the PostgreSQL source directory.
cd /path/to/postgres
gtags --gtagslabel=ctagsGenerated key files:
GTAGS: Definitions databaseGRTAGS: References databaseGPATH: Path mapping
Incremental update after source changes:
gtags --incrementalIf you manage multiple branches or trees on the same machine, you can isolate object directories:
export GTAGSOBJDIRPREFIX=$HOME/.gtags/objs
gtags(Each source root will then get its own object directory automatically.)
global queries the index to locate symbol definitions and references.
Frequently used options:
- Definitions:
global -d SYMBOL - References (callers, usages):
global -r SYMBOL - Cross reference with line numbers:
global -x SYMBOL - Partial/substring match:
global -c PATTERNorglobal -s PATTERN - List symbols defined in a file:
global -f path/to/file.c - Regex match across tags:
global -a -e 'Exec[A-Za-z0-9_]+' - Detailed reference lines (used in this repo):
global -rx SYMBOL
Examples:
# Definition of heap_insert
global -x heap_insert
# All places referencing ExecProcNode
global -rx ExecProcNode
# List all symbols in parse_expr.c
global -f src/backend/parser/parse_expr.c| Script | Uses GNU GLOBAL Output | Prerequisite |
|---|---|---|
create_duckdb_index.py |
GTAGS (definitions) |
Run gtags in PostgreSQL root beforehand |
extract_symbol_references.py |
global -rx SYMBOL requiring GTAGS/GRTAGS/GPATH |
Same |
filter_frequent_symbol_from_csv.py |
Processes CSV produced upstream | Upstream CSV exists |
import_symbol_reference.py |
Filtered reference CSV | DuckDB DB + CSV |
| Other update/processing scripts | DuckDB tables | create_duckdb_index.py completed |
When running create_duckdb_index.py, pass the PostgreSQL source directory path as the argument. That directory must already contain the generated GTAGS files.
| Symptom | Possible Cause | Resolution |
|---|---|---|
global: GTAGS not found. |
Index not created | Run gtags at the source root |
| Reference results empty | Macro expansion / alternative symbol name | Try related identifiers / adjust label / verify commit |
| Results outdated | Source changed without reindex | Run gtags --incremental or regenerate |
| High memory usage | Full tree reindex repeatedly | Limit scope (invoke in subset), or ensure adequate RAM |
The scripts in this project incrementally build and process symbol information from the PostgreSQL code tree into a DuckDB database, then generate documentation using AI agents.
The processing targets and recommended execution order for each script are as follows:
- create_duckdb_index.py
Based on the index output from GNU GLOBAL, creates and stores thesymbol_definitionstable inglobal_symbols.db.- Main columns:
id(primary key),symbol_name,file_path,line_num_start,line_num_end,line_content,contents
- Must be run first to import all symbol information into the DB.
- Main columns:
- set_file_end_lines.py
Sets the range (line_num_end) for each symbol definition within the file.- Target:
symbol_definitionstable (updates theline_num_endcolumn)
- Target:
-
extract_symbol_references.py
Usesglobal -rxfor each symbol insymbol_definitionsto output reference relationships assymbol_references.csv. -
filter_frequent_symbol_from_csv.py
Filters frequent symbols and unnecessary references to generatesymbol_references_filtered.csv. -
import_symbol_reference.py
Importssymbol_references_filtered.csvinto thesymbol_referencetable inglobal_symbols.db.- Columns:
from_node(source symbol ID),to_node(target symbol ID),line_num_in_from
- Columns:
- update_symbol_types.py
Adds asymbol_typecolumn to thesymbol_definitionstable and automatically estimates and records the type (function/variable/type, etc.) .
- process_symbol_definitions.py
Cleans up duplicate definitions and unnecessary data, outputs statistics, etc.
- scripts/prepare_cluster.py
Automatically clusters symbols based on dependencies and prepares batches (e.g.,data/processing_batches.json) for AI documentation generation.
Be sure to run this step.
- scripts/orchestrator.py
Orchestrates automatic documentation generation and management by AI (e.g., Claude Code) based on symbol, reference, and cluster information in DuckDB.- Also supports batch processing plans via
data/processing_batches.json
- Also supports batch processing plans via
# 0. (Prerequisite) Prepare GNU GLOBAL index inside PostgreSQL source tree
# cd /path/to/postgres && gtags --gtagslabel=ctags
# 1. Import symbol definitions into DB
python misc/create_duckdb_index.py <source_directory>
# 2. Complete symbol range information
python misc/set_file_end_lines.py
# 3. Extract and filter reference relationships
python misc/extract_symbol_references.py
python misc/filter_frequent_symbol_from_csv.py
# 4. Import reference information into DB
python misc/import_symbol_reference.py
# 5. Assign symbol type information
python misc/update_symbol_types.py
# 6. Deduplication and final processing
python misc/process_symbol_definitions.py
# 7. Symbol clustering and batch generation (required)
python misc/scripts/prepare_cluster.py
# 8. AI-based documentation generation (tool-assisted)
python scripts/orchestrator.py
# 9. Update internal links between generated documents
python misc/update_markdown_links.py
# 10. Add GitHub location links to the generated documents
python misc/add_github_location_links.py
# 11. Clean up backticks around markdown links
python misc/cleanup_link_backticks.py-
scripts/orchestrator.py
The agent creates Markdown (.md) files in theoutput/tempdirectory viamcp_tool.py.
orchestrator.py extracts the contents of these md files in output/temp and adds them to thedocumentstable inglobal_symbols.db, enabling AI-based documentation for each symbol. -
scripts/mcp_tool.py (Tool Interface for the AI Agent)
mcp_tool.pyimplements the tool-side interface (e.g., Model Context Protocol style) that the AI agent invokes to obtain exactly the contextual information required to write a symbol's documentation. Its responsibilities typically include:- Fetching symbol definition records (name, file path, source snippet, start/end lines) from the DuckDB database.
- Supplying reference relationships (incoming/outgoing dependencies) so the agent can describe usage patterns.
- Providing cluster/batch metadata (e.g., related symbols processed together) to encourage consistent narrative across a group of symbols.
- Returning any existing previously generated documents (for incremental refinement or avoidance of duplication).
- Accepting newly generated Markdown content from the agent and writing it to
output/temp/*.mdsoorchestrator.pycan ingest it into thedocumentstable. - (Optionally) Logging tool invocations or marking progress so that interrupted runs can resume.
Conceptual interaction flow:
- Orchestrator selects next symbol (or batch) from planning data (
data/processing_batches.json). - AI agent requests context via exposed tool methods in
mcp_tool.py(e.g., get_symbol_definition, get_references, get_cluster_context). - AI agent generates a Markdown document and calls a write/persist method in
mcp_tool.pyto store it underoutput/temp/. - Orchestrator later scans
output/temp/and imports finalized content into DuckDB.
This separation keeps the agent stateless regarding storage details, while
mcp_tool.pyabstracts data retrieval and persistence. If you extend documentation scope (e.g., add performance notes, code metrics), you can augmentmcp_tool.pywith new retrieval endpoints without modifying the orchestration logic. -
Progress Management and Logging
Progress of documentation generation/import and batch status logs byorchestrator.pyare recorded inmetadata.duckdb.
- Markdown Generation by AI Agent
mcp_tool.pyprovides context; the agent creates md files for each symbol/file inoutput/temp/
- DB Registration by orchestrator.py
- Searches for
output/temp/*.md, extracts contents - Adds contents to the
documentstable inglobal_symbols.db
- Searches for
- Progress Logging
- Records processing status and error information for each batch/file in
metadata.duckdb
- Records processing status and error information for each batch/file in
The main database containing symbol information from the PostgreSQL source code.
Main table for symbol definitions. Base data for all scripts.
| Column | Type | Description |
|---|---|---|
id |
INTEGER PRIMARY KEY | Unique identifier for each symbol |
symbol_name |
VARCHAR NOT NULL | Name of the symbol |
file_path |
VARCHAR NOT NULL | Path to the file containing the symbol |
line_num_start |
INTEGER NOT NULL | Starting line number of the symbol definition |
line_num_end |
INTEGER NOT NULL DEFAULT 0 | Ending line number of the symbol definition |
line_content |
VARCHAR NOT NULL | Content of the line where symbol is defined |
contents |
VARCHAR DEFAULT '' | Full content of the symbol definition |
symbol_type |
VARCHAR | Type of symbol (added by update_symbol_types.py) |
Records reference relationships between symbols (created by import_symbol_reference.py).
| Column | Type | Description |
|---|---|---|
from_node |
INTEGER NOT NULL | ID of the source symbol |
to_node |
INTEGER NOT NULL | ID of the target symbol |
line_num_in_from |
INTEGER NOT NULL | Line number in source file where reference occurs |
Manages metadata for AI documentation generation process (created by scripts/prepare_cluster.py).
Copy of symbol information with additional metadata.
| Column | Type | Description |
|---|---|---|
id |
INTEGER PRIMARY KEY | Symbol ID (same as in global_symbols.db) |
symbol_name |
VARCHAR | Name of the symbol |
symbol_type |
VARCHAR | Type of symbol (f=function, s=struct, v=variable) |
file_path |
VARCHAR | Path to the file containing the symbol |
module |
VARCHAR | Module/component the symbol belongs to |
start_line |
INTEGER | Starting line number |
end_line |
INTEGER | Ending line number |
layer |
INTEGER | Dependency layer (0=base, higher=dependent) |
cluster_id |
INTEGER | ID of the cluster this symbol belongs to |
created_at |
TIMESTAMP | When the record was created |
Stores dependency relationships (created by scripts/prepare_cluster.py).
| Column | Type | Description |
|---|---|---|
from_node |
INTEGER | Source symbol ID |
to_node |
INTEGER | Target symbol ID |
| PRIMARY KEY | (from_node, to_node) | Composite key ensuring unique relationships |
Stores information about symbol clusters for batch processing.
| Column | Type | Description |
|---|---|---|
cluster_id |
INTEGER PRIMARY KEY | Unique cluster identifier |
cluster_type |
VARCHAR | Type of cluster |
layer |
INTEGER | Dependency layer |
symbols |
JSON | Array of symbol IDs in this cluster |
estimated_tokens |
INTEGER | Estimated token count for processing |
created_at |
TIMESTAMP | When the cluster was created |
Stores generated documentation (created by scripts/orchestrator.py).
Contains AI-generated documentation for each symbol.
| Column | Type | Description |
|---|---|---|
symbol_id |
INTEGER PRIMARY KEY | Symbol ID |
symbol_name |
VARCHAR | Name of the symbol |
symbol_type |
VARCHAR | Type of symbol |
layer |
INTEGER | Dependency layer |
content |
TEXT | Full markdown documentation |
summary |
TEXT | Brief summary of the symbol |
dependencies |
JSON | List of dependencies |
related_symbols |
JSON | List of related symbols |
created_at |
TIMESTAMP | When document was created |
updated_at |
TIMESTAMP | When document was last updated |
Tracks the progress of documentation generation batches.
| Column | Type | Description |
|---|---|---|
batch_id |
INTEGER PRIMARY KEY | Unique batch identifier |
symbol_ids |
JSON | Array of symbol IDs in this batch |
status |
VARCHAR | Processing status (pending/processing/completed/failed) |
started_at |
TIMESTAMP | When processing started |
completed_at |
TIMESTAMP | When processing completed |
error_message |
TEXT | Error message if failed |
processed_count |
INTEGER | Number of symbols processed |
- ENTRY_POINTS.md : Entry points (functions) of PostgreSQL's each processes are listed
- CODE_TREE.md : Overview of PostgreSQL's code tree structure (AI generated)
The comprehensive PostgreSQL WAL (Write-Ahead Logging) system documentation in topic_specific_generated_docs/about_wal/ was generated using an AI-powered multi-agent orchestration approach. This section describes the generation process and mechanisms involved.
The documentation was created by orchestrating three specialized AI subagents through Claude Code's interactive mode, leveraging an MCP (Model Context Protocol) server for PostgreSQL symbol information access.
The MCP server provides programmatic access to PostgreSQL symbol information through specialized tools:
- Installation: Clone any-script-mcp into the project directory
- Configuration: Run
../any-script-mcp/add_any_script_mcp_to_cc.shin cloned any-script-mcp repo dir to register the MCP server - Settings: User-scope registration with configuration stored at
~/.config/any-script-mcp/config.yaml - Tools Directory: Custom tool configurations in
any-script-mcp/config.yaml- This file is same with
~/.config/any-script-mcp/config.yaml
- This file is same with
The MCP server exposes PostgreSQL-specific tools:
pg_symbol_source(symbol)- Retrieve source code for symbolspg_symbol_overview(symbol)- Get concise symbol overviewspg_symbol_document(symbol)- Fetch detailed documentationpg_references_from(symbol)- List symbols referenced by a given symbolpg_references_to(symbol)- List symbols that reference a given symbol
Three specialized subagents handle different aspects of documentation generation:
- architecture-analyzer - Analyzes PostgreSQL codebase structure and builds dependency maps
- detail-documenter - Creates detailed technical documentation for individual components
- integration-optimizer - Integrates component documentation into cohesive final documents
Each subagent is defined with specific capabilities and prompts in the .claude/agents/ directory.
The main orchestration prompt coordinates the three-stage documentation generation process:
Stage 1: Architecture Analysis
- Analyzes WAL subsystem starting from key entry points (XLogInsert, WalSndLoop, WalReceiverMain, StartupXLOG, etc.)
- Builds comprehensive dependency maps with depth-3 traversal
- Generates
architecture_map.json,key_symbols.txt, andinitial_outline.md
Stage 2: Detailed Documentation
- Documents all symbols with importance scores > 0.8
- Creates component-specific documentation files
- Generates minimum 5 technical diagrams (Mermaid format)
- Focuses on critical areas like XLogRecord structure, LSN management, and replication mechanics
Stage 3: Integration and Optimization
- Merges component documentation into unified documents
- Eliminates redundancy while maintaining clarity
- Adds cross-references and navigation aids
- Produces final deliverables including:
wal_complete_documentation.md- Comprehensive technical guidewal_quick_reference.md- 2-page developer referencewal_api_reference.md- API signatures catalogquality_report.md- Coverage metrics and analysis
-
Prerequisites: Ensure MCP server is configured and Claude Code has access to subagent definitions
-
Execution:
- Open Claude Code in interactive mode
- Copy the content from
prompts/generate_document_about_wal.md - Paste as a prompt in Claude Code
- The orchestrator automatically executes all three stages sequentially
-
Output: Generated documentation is saved to project root directory...
The final WAL documentation includes:
- Core Documentation: comprehensive technical guide covering all WAL components
- Component Files: Separate documentation for generation, writing, replication, and recovery subsystems
- Technical Diagrams: 5+ Mermaid diagrams illustrating architecture and data flows
- Reference Materials: Quick reference guide and API documentation
- Agent Can Use Pre Generated Knowledges: Knowledges without on-time context consumption make output better, maybe :)
- Systematic Coverage: Ensures all critical symbols and paths are documented
- Consistency: Standardized terminology and structure across all documentation
- Depth and Breadth: Combines high-level architecture views with detailed implementation specifics
- Maintainability: Modular generation allows for incremental updates
- Quality Assurance: Built-in validation and coverage metrics
This multi-agent approach demonstrates how AI can be effectively orchestrated to generate comprehensive, professional-quality technical documentation for complex systems like PostgreSQL's WAL subsystem.