Skip to content

Markdown Parser: Core Implementation #4

@raifdmueller

Description

@raifdmueller

Description

Implementation of the Markdown Parser (GFM) according to specification 04_markdown_parser.adoc.

Scope

The MarkdownParser is a lightweight component for parsing GitHub Flavored Markdown. It is NOT a full GFM parser.

What it does:

  • Extract document structure (headings → hierarchical sections)
  • Identify addressable elements (code blocks, tables, images)
  • Parse YAML frontmatter metadata
  • Map folder hierarchy to document structure
  • Track source file + line numbers for all elements

What it does NOT do:

  • Render HTML
  • Parse inline formatting (bold, italic, inline links)
  • Analyze table contents
  • Support Setext headings, footnotes, or math blocks

Implementation Tasks

Core Parsing

  • Heading Extraction (AC-MD-01): # to ###### (ATX-style only)
  • YAML Frontmatter (AC-MD-02): --- block at file start
  • Code Block Extraction (AC-MD-03): Fenced blocks with language
  • Table Recognition (AC-MD-04): GFM pipe-tables (structure only)
  • Image Extraction: ![alt](src "title") pattern

Folder-as-Document

  • Folder Scanning (AC-MD-05): Recursive directory traversal
  • Sorting (AC-MD-06):
    • index.md / README.md always first
    • Numeric prefixes: 01_, 02_, ... 10_, 11_ (natural sort)
    • Alphabetic fallback
  • Hierarchy Mapping: Folder depth → section level offset

Data Models

@dataclass
class MarkdownDocument:
    file_path: Path
    frontmatter: dict[str, Any]
    title: str
    sections: list[Section]  # Reuse from models.py
    elements: list[Element]  # Reuse from models.py

@dataclass
class FolderDocument:
    root_path: Path
    documents: list[MarkdownDocument]
    structure: list[Section]  # Combined hierarchy

Interface Methods

class MarkdownParser:
    def parse_file(self, file_path: Path) -> MarkdownDocument
    def parse_folder(self, folder_path: Path) -> FolderDocument
    def get_section(self, doc: MarkdownDocument, path: str) -> Section | None
    def get_elements(self, doc: MarkdownDocument, element_type: str | None = None) -> list[Element]

Acceptance Criteria

  • AC-MD-01: Heading extraction with correct hierarchy
  • AC-MD-02: YAML frontmatter parsing (strings, numbers, lists, nested objects)
  • AC-MD-03: Fenced code blocks with language detection
  • AC-MD-04: Table detection with column/row count
  • AC-MD-05: Folder structure correctly mapped
  • AC-MD-06: Numeric prefix sorting (1, 2, 10 not 1, 10, 2)

Regex Patterns (from spec)

HEADING_PATTERN = r'^(#{1,6})\s+(.+?)(?:\s+#*)?$'
CODE_FENCE_OPEN = r'^(`{3,}|~{3,})(\w*)?$'
FRONTMATTER_PATTERN = r'^---\s*\n(.*?)\n---\s*\n'
IMAGE_PATTERN = r'!\[([^\]]*)\]\(([^)\s]+)(?:\s+"([^"]*)")?\)'
TABLE_ROW_PATTERN = r'^\|(.+)\|$'

Dependencies

  • PyYAML or ruamel.yaml for frontmatter parsing
  • Reuse Section, Element, SourceLocation from models.py

References

  • src/docs/spec/04_markdown_parser.adoc
  • src/mcp_server/models.py (shared data models)
  • src/mcp_server/asciidoc_parser.py (reference implementation)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions