Skip to content

PDD-CLI Bug: Generates Data Models Without Extraction/Parsing Logic #585

@jiaminc-cmu

Description

@jiaminc-cmu

PDD-CLI Bug: Generates Data Models Without Extraction/Parsing Logic

PDD-CLI creates data model classes with field definitions, but doesn't implement the logic to extract and populate those fields from actual data sources. Models exist but remain empty because data isn't parsed.

Why this matters: Generated models are unusable - all fields remain None or default values because nothing populates them from source data.

Concrete Example

For a contact management system:

# PDD generated model (INCOMPLETE):
# models/contact.py
from pydantic import BaseModel

class Contact(BaseModel):
    email: str
    name: str
    company: Optional[str] = None
    labels: List[str] = []

But no extraction logic:

# handlers/create_contact.py
def create_contact(issue_body: str) -> Contact:
    # PDD generated this - but how to extract fields?
    contact = Contact(email="???", name="???")  # ← No parsing implemented!
    return contact

What went wrong: PDD defined the model structure but didn't implement parsing logic to extract email, name, company from the GitHub issue body format.

Impact: All contacts created with placeholder data, actual data from issues never extracted.

Why PDD Makes This Mistake

PDD-CLI currently:

  • Generates data structures (models) separately from data pipelines (extraction)
  • Defines "what" without implementing "how"
  • Assumes extraction logic will be added later

But it should:

  1. Generate complete data pipeline: parse → validate → transform → store
  2. Implement extraction logic for defined fields
  3. Handle parsing failures gracefully

How to Prevent This in PDD-CLI

What PDD should do differently:

  1. Generate complete data pipeline:

    def parse_contact_from_issue(issue_body: str) -> Contact:
        """Extract contact fields from GitHub issue body."""
        import re
        
        # Extract email
        email_match = re.search(r'Email:\s*(\S+@\S+)', issue_body)
        email = email_match.group(1) if email_match else None
        
        # Extract name
        name_match = re.search(r'Name:\s*(.+)', issue_body)
        name = name_match.group(1).strip() if name_match else None
        
        # Extract company
        company_match = re.search(r'Company:\s*(.+)', issue_body)
        company = company_match.group(1).strip() if company_match else None
        
        if not email or not name:
            raise ValueError("Missing required fields")
        
        return Contact(email=email, name=name, company=company)
  2. Generate validation and error handling: Handle malformed input gracefully.

  3. Generate tests for extraction: Ensure parsing works correctly.

Example improvement:

Current: "Create Contact model"
       → Generate Contact class
       → No extraction logic
       → Fields never populated

Improved: "Create Contact model"
        → Generate Contact class
        → Generate parse_contact_from_issue()
        → Generate validation logic
        → Generate tests with sample data
        → Complete, working pipeline

Severity

P1 - High Priority

  • Frequency: Medium - affects data-driven features
  • Impact: High - features non-functional (models never populated)
  • Detectability: High - obvious when data remains empty
  • Prevention cost: Medium - requires understanding data format and generating parsing logic

Category

incomplete-implementation

Related Issues


For Contributors: Discovered when Contact model existed but GitHub issue data was never extracted into it, manual parsing logic added in commit 34a651d5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions