Skip to content

PDD-CLI Bug: Generates Test Data Without Proper CSV Escaping #577

@jiaminc-cmu

Description

@jiaminc-cmu

PDD-CLI Bug: Generates Test Data Without Proper CSV Escaping

PDD-CLI generates CSV test data with unquoted fields containing commas, breaking CSV parsing. When test data includes comma-separated values (like tags or labels), PDD doesn't quote the fields.

Why this matters: Test data fails to parse, causing CSV parsing errors and test failures.

Concrete Example

For a test that creates GitHub issues with labels:

# PDD generated test data (WRONG):
# test_data.csv
email,name,labels
user1@example.com,John Doe,attendee,vip
user2@example.com,Jane Smith,speaker,sponsor

CSV parser reads this as:

# Row 1 has 5 fields instead of 3!
['user1@example.com', 'John Doe', 'attendee', 'vip']  # ← Extra fields!

Correct format:

# Should generate (CORRECT):
# test_data.csv
email,name,labels
user1@example.com,John Doe,"attendee,vip"
user2@example.com,Jane Smith,"speaker,sponsor"

What went wrong: PDD generated labels as attendee,vip without quotes. The CSV parser treats the comma as a field delimiter, splitting into 5 fields instead of 3.

Impact: csv.DictReader throws error or creates malformed records with extra fields.

Why PDD Makes This Mistake

PDD-CLI currently:

  • Generates CSV as plain text
  • Doesn't quote fields containing special characters
  • Doesn't use proper CSV writing libraries

But it should:

  1. Use csv.DictWriter or equivalent to handle escaping
  2. Always quote fields containing commas, quotes, or newlines
  3. Follow RFC 4180 CSV spec

How to Prevent This in PDD-CLI

What PDD should do differently:

  1. Use CSV libraries for generation:

    import csv
    
    with open('test_data.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['email', 'name', 'labels'])
        writer.writeheader()
        writer.writerow({
            'email': 'user1@example.com',
            'name': 'John Doe',
            'labels': 'attendee,vip'  # Library handles quoting
        })
  2. Manual generation - always quote fields with commas:

    email,name,labels
    user1@example.com,John Doe,"attendee,vip"
    
  3. Validate generated CSV: Parse it back to ensure it works.

Example improvement:

Current: Generate CSV as string concatenation
       → labels = "attendee,vip" (no quotes)
       → CSV broken (4 fields instead of 3)

Improved: Generate CSV using csv.DictWriter
        → Automatic quoting for fields with commas
        → Valid CSV produced

Severity

P2 - Medium Priority

  • Frequency: Low - only affects CSV test data generation
  • Impact: Test data parsing failures
  • Detectability: High - immediate CSV parsing errors
  • Prevention cost: Low - use CSV libraries

Category

test-environment

Related Issues


For Contributors: Discovered in backend/tests/test_crm_github.py where GitHub issue labels CSV was malformed, fixed in commit 34a651d5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions