-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
component:markdown-parserMarkdown Parser componentMarkdown Parser componenttype:featureNew featureNew feature
Milestone
Description
Description
Implementation of the Markdown Parser (GFM) according to specification 04_markdown_parser.adoc.
Scope
The MarkdownParser is a lightweight component for parsing GitHub Flavored Markdown. It is NOT a full GFM parser.
What it does:
- Extract document structure (headings → hierarchical sections)
- Identify addressable elements (code blocks, tables, images)
- Parse YAML frontmatter metadata
- Map folder hierarchy to document structure
- Track source file + line numbers for all elements
What it does NOT do:
- Render HTML
- Parse inline formatting (bold, italic, inline links)
- Analyze table contents
- Support Setext headings, footnotes, or math blocks
Implementation Tasks
Core Parsing
- Heading Extraction (AC-MD-01):
#to######(ATX-style only) - YAML Frontmatter (AC-MD-02):
---block at file start - Code Block Extraction (AC-MD-03): Fenced blocks with language
- Table Recognition (AC-MD-04): GFM pipe-tables (structure only)
- Image Extraction:
pattern
Folder-as-Document
- Folder Scanning (AC-MD-05): Recursive directory traversal
- Sorting (AC-MD-06):
index.md/README.mdalways first- Numeric prefixes:
01_,02_, ...10_,11_(natural sort) - Alphabetic fallback
- Hierarchy Mapping: Folder depth → section level offset
Data Models
@dataclass
class MarkdownDocument:
file_path: Path
frontmatter: dict[str, Any]
title: str
sections: list[Section] # Reuse from models.py
elements: list[Element] # Reuse from models.py
@dataclass
class FolderDocument:
root_path: Path
documents: list[MarkdownDocument]
structure: list[Section] # Combined hierarchyInterface Methods
class MarkdownParser:
def parse_file(self, file_path: Path) -> MarkdownDocument
def parse_folder(self, folder_path: Path) -> FolderDocument
def get_section(self, doc: MarkdownDocument, path: str) -> Section | None
def get_elements(self, doc: MarkdownDocument, element_type: str | None = None) -> list[Element]Acceptance Criteria
- AC-MD-01: Heading extraction with correct hierarchy
- AC-MD-02: YAML frontmatter parsing (strings, numbers, lists, nested objects)
- AC-MD-03: Fenced code blocks with language detection
- AC-MD-04: Table detection with column/row count
- AC-MD-05: Folder structure correctly mapped
- AC-MD-06: Numeric prefix sorting (1, 2, 10 not 1, 10, 2)
Regex Patterns (from spec)
HEADING_PATTERN = r'^(#{1,6})\s+(.+?)(?:\s+#*)?$'
CODE_FENCE_OPEN = r'^(`{3,}|~{3,})(\w*)?$'
FRONTMATTER_PATTERN = r'^---\s*\n(.*?)\n---\s*\n'
IMAGE_PATTERN = r'!\[([^\]]*)\]\(([^)\s]+)(?:\s+"([^"]*)")?\)'
TABLE_ROW_PATTERN = r'^\|(.+)\|$'Dependencies
- PyYAML or ruamel.yaml for frontmatter parsing
- Reuse
Section,Element,SourceLocationfrommodels.py
References
src/docs/spec/04_markdown_parser.adocsrc/mcp_server/models.py(shared data models)src/mcp_server/asciidoc_parser.py(reference implementation)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component:markdown-parserMarkdown Parser componentMarkdown Parser componenttype:featureNew featureNew feature