Skip to content

Extend Dataset with Increased Diversity and Complexity #37

@hanneshapke

Description

@hanneshapke

Extend Dataset with Increased Diversity and Complexity

Background

The current dataset (~20,000 samples) provides a solid foundation for PII detection, but has limited diversity across several critical dimensions. To build a robust, production-ready model that generalizes well to real-world scenarios, we need to significantly expand dataset coverage.

Current Limitations

Languages:

  • ✅ Active: 6 languages (English, German, French, Spanish, Dutch, Danish)
  • ❌ Inactive: 30+ languages commented out in code
  • Impact: Limited multilingual performance, Western European bias

Co-reference Complexity:

  • ✅ Simple: 1-2 entity clusters, direct pronouns
  • ❌ Missing: Multi-entity chains, cross-sentence references, ambiguous references
  • Impact: Fails on complex conversational text

PII Types:

  • ✅ Coverage: ~20 PII types (names, emails, addresses, IDs)
  • ❌ Missing: Rare PII types, domain-specific identifiers, compound entities
  • Impact: Misses edge cases in specialized domains

Scenarios:

  • ✅ Coverage: Emails, forms, medical records, social posts
  • ❌ Missing: Legal documents, technical support, contracts, chat logs, forums
  • Impact: Lower accuracy on underrepresented text types

Writing Styles:

  • ✅ Coverage: Formal business, medical, casual social
  • ❌ Missing: Technical, legal, academic, slang, code-switched text
  • Impact: Style-dependent performance degradation

Enhancement Goals

Goal 1: Expand Language Coverage (6 → 30+ languages)

Priority Tier 1: Major World Languages

  • 🇬🇧 English (current)
  • 🇪🇸 Spanish (current)
  • 🇫🇷 French (current)
  • 🇩🇪 German (current)
  • 🇨🇳 Chinese (Simplified & Traditional) ⭐ NEW
  • 🇯🇵 Japanese ⭐ NEW
  • 🇮🇳 Hindi ⭐ NEW
  • 🇸🇦 Arabic ⭐ NEW
  • 🇵🇹 Portuguese (Brazil & Portugal) ⭐ NEW
  • 🇷🇺 Russian ⭐ NEW
  • 🇰🇷 Korean ⭐ NEW

Priority Tier 2: Regional Languages

  • 🇮🇹 Italian
  • 🇳🇱 Dutch (current)
  • 🇵🇱 Polish
  • 🇹🇷 Turkish
  • 🇸🇪 Swedish
  • 🇳🇴 Norwegian
  • 🇩🇰 Danish (current)
  • 🇫🇮 Finnish
  • 🇬🇷 Greek
  • 🇷🇴 Romanian
  • 🇺🇦 Ukrainian
  • 🇨🇿 Czech
  • 🇮🇩 Indonesian
  • 🇻🇳 Vietnamese
  • 🇹🇭 Thai
  • 🇵🇭 Tagalog
  • 🇰🇪 Swahili
  • 🇿🇦 Afrikaans

Goal 2: Increase Co-reference Complexity

Current:

"John Smith called. He said his email is john@email.com."
Clusters: [John Smith, He, his] → Simple 1-cluster

Target - Multi-Entity Chains:

"Sarah met Tom at the office. She told him about the project. 
They discussed it with their manager, Dr. Chen, who approved 
their proposal. Sarah's team will work with Tom's department."

Clusters:
- Cluster 1: Sarah, She, her, Sarah's
- Cluster 2: Tom, him, Tom's, his
- Cluster 3: Dr. Chen, their manager, who
- Cluster 4: the project, it, their proposal
- Cluster 5: Sarah's team, their (ambiguous)

Target - Cross-Sentence References:

"Contact: john.doe@company.com
Department: Engineering
Manager: Sarah Thompson

John has been with the company for 5 years. His expertise
in machine learning makes him an excellent candidate. We
recommend scheduling an interview with him and his team lead."

Clusters spanning multiple sentences

Target - Ambiguous References:

"The customer called about their order. The representative 
said they would check the status. They both agreed to follow up."

Ambiguous "they" - requires context resolution

Goal 3: Expand PII Types

Current Coverage (20 types):

  • Names: FIRSTNAME, SURNAME
  • Contact: EMAIL, PHONENUMBER, URL
  • Location: ADDRESS, CITY, STATE, ZIP, COUNTRY, STREET, BUILDINGNUM
  • IDs: SSN, DRIVER_LICENSE, PASSPORT, NATIONAL_ID, IDCARDNUM, TAXNUM
  • Financial: IBAN
  • Other: DATEOFBIRTH, AGE, COMPANYNAME

New PII Types to Add:

Identity:

  • MIDDLENAME
  • NICKNAME / ALIAS
  • USERNAME (social media handles)
  • FULL_NAME (compound entity)
  • TITLE (Dr., Prof., Mr., Mrs.)

Contact Extended:

  • FAX_NUMBER
  • MOBILE_NUMBER (separate from landline)
  • EXTENSION (phone extension)
  • SOCIAL_MEDIA_HANDLE (@username)
  • SKYPE_ID / ZOOM_ID

Location Extended:

  • APARTMENT_NUMBER
  • FLOOR_NUMBER
  • POSTAL_CODE (international)
  • LANDMARK
  • GPS_COORDINATES

IDs Extended:

  • MEDICAL_RECORD_NUMBER
  • PATIENT_ID
  • EMPLOYEE_ID
  • STUDENT_ID
  • INSURANCE_POLICY_NUMBER
  • MEMBERSHIP_NUMBER
  • VEHICLE_VIN
  • LICENSE_PLATE (extended beyond LICENSEPLATENUM)

Financial Extended:

  • CREDIT_CARD_NUMBER
  • BANK_ACCOUNT_NUMBER
  • ROUTING_NUMBER
  • SWIFT_CODE
  • CRYPTO_WALLET_ADDRESS
  • PAYMENT_CARD_CVV

Biometric:

  • IP_ADDRESS
  • MAC_ADDRESS
  • DEVICE_ID / IMEI
  • BIOMETRIC_ID

Temporal:

  • TIME (specific time references)
  • TIMESTAMP
  • AGE_RANGE (instead of exact age)

Professional:

  • JOB_TITLE
  • EMPLOYER
  • SALARY / COMPENSATION
  • WORK_EMAIL (vs personal)

Medical:

  • MEDICAL_CONDITION
  • PRESCRIPTION_NUMBER
  • HEALTHCARE_PROVIDER
  • INSURANCE_GROUP_NUMBER

Goal 4: Diversify Scenarios and Contexts

Current Scenarios:

  • Business emails
  • Medical records
  • Customer service forms
  • Social media posts

New Scenarios to Add:

Professional:

  • Legal contracts and NDAs
  • HR documents (resumes, performance reviews)
  • Technical support tickets
  • Meeting transcripts / notes
  • Project proposals
  • Sales communications

Consumer:

  • E-commerce orders and receipts
  • Shipping/tracking notifications
  • Product reviews with PII
  • Customer complaints
  • Subscription cancellations
  • Warranty claims

Financial:

  • Bank statements
  • Loan applications
  • Tax documents
  • Investment portfolios
  • Insurance claims
  • Payment disputes

Healthcare:

  • Patient intake forms
  • Lab results
  • Prescription orders
  • Insurance authorizations
  • Medical histories
  • Appointment schedules

Educational:

  • Student registration forms
  • Transcripts and grades
  • Recommendation letters
  • Scholarship applications
  • Course evaluations

Government:

  • Visa applications
  • Permit requests
  • Voter registration
  • Census forms
  • Benefits applications

Social/Personal:

  • Dating profiles
  • Forum posts with personal info
  • Chat conversations
  • Blog comments
  • Personal introductions

Technical:

  • Bug reports with user data
  • API logs with PII
  • Database dumps (sanitized examples)
  • Error messages with personal info
  • System logs

Goal 5: Increase Stylistic Diversity

Formality Levels:

  • Very Formal: Legal, government, academic
  • Formal: Business, medical, official
  • Semi-Formal: Professional emails, reports
  • Informal: Social media, blogs, forums
  • Very Informal: Chat, SMS, slang

Writing Styles:

  • Technical documentation
  • Legal language
  • Academic papers
  • Journalistic articles
  • Marketing copy
  • Personal narratives
  • Instructional text
  • Conversational dialogue

Tonal Variations:

  • Neutral/objective
  • Friendly/warm
  • Urgent/pressing
  • Apologetic
  • Assertive
  • Empathetic

Text Formats:

  • Structured forms (key: value)
  • Free-form narratives
  • Bullet points / lists
  • Tables
  • Mixed formats
  • Code snippets with PII
  • JSON/XML with personal data

Goal 6: Add Cross-Cultural Variations

Naming Conventions:

  • Western: First Last
  • Eastern: Last First (Chinese, Korean, Japanese)
  • Hispanic: Multiple surnames (María García López)
  • Arabic: Long patronymic chains (Ahmed bin Mohammed bin...)
  • Single names (Indonesian, Thai)
  • Titles as part of name (Jr., Sr., III)

Address Formats:

  • US: Street, City, State ZIP
  • UK: Street, Town, County, Postcode
  • German: Street number-after, PLZ City
  • Japanese: Prefecture, City, Ward, Block
  • Arabic: No street names, descriptive landmarks

Date Formats:

  • US: MM/DD/YYYY
  • European: DD/MM/YYYY
  • ISO: YYYY-MM-DD
  • Written: January 15, 2024 / 15 January 2024

Phone Number Formats:

  • US: (555) 123-4567
  • International: +1-555-123-4567
  • UK: 020 7946 0958
  • Various regional formats

Implementation Plan

Phase 1: Language Expansion

File: model/dataset/training_set.py

Enable commented-out languages progressively:

def get_languages(self, language_count: int = 10, seed: int | None = 42, is_testing: bool = False):
    # Tier 1: Major world languages (add first)
    tier1_languages = (
        "English", "German", "French", "Spanish", "Dutch", "Danish",
        "Chinese (Simplified)", "Japanese", "Hindi", "Arabic",  # NEW
        "Portuguese", "Russian", "Korean",  # NEW
    )
    
    # Tier 2: Regional languages
    tier2_languages = (
        "Italian", "Polish", "Turkish", "Swedish",
        "Norwegian", "Finnish", "Greek", "Romanian",
        "Ukrainian", "Czech", "Indonesian", "Afrikaans",
        "Vietnamese", "Thai", "Tagalog", "Swahili", 
    )
    
    all_languages = tier1_languages + tier2_languages
    
    if is_testing:
        return [all_languages[0]]
    else:
        return all_languages[:language_count]

Phase 2: Enhanced Prompt Templates

File: model/dataset/prompts.py

Add scenario-specific prompts:

SCENARIO_TEMPLATES = {
    "legal_contract": """
Generate a sample legal contract excerpt containing PII such as:
- Party names (full legal names with titles)
- Addresses (complete with apartment/suite numbers)
- Contact information
- Identification numbers
- Dates and signatures
Include complex co-reference chains with legal terminology.
""",
    
    "technical_support": """
Generate a technical support ticket containing:
- Customer name and contact info
- Account numbers / user IDs
- Device identifiers (IP, MAC address)
- Timestamps and system details
Include conversational back-and-forth with pronouns.
""",
    
    "multi_party_email": """
Generate an email thread with 3+ participants where:
- Multiple people are CC'd with names and emails
- People refer to each other with pronouns
- Contains complex co-reference chains
- Includes meeting times, locations, phone numbers
""",
    
    # Add 20+ more scenario templates...
}

Phase 3: Extended Label Configuration

File: model/dataset/label_utils.py

Add new PII labels:

LABEL_DESCRIPTIONS: ClassVar[dict[str, str]] = {
    # Existing labels...
    
    # NEW Identity labels
    "MIDDLENAME": "middle name",
    "NICKNAME": "nickname or alias",
    "USERNAME": "username or handle",
    "TITLE": "personal title (Dr., Prof., etc.)",
    
    # NEW Contact labels
    "MOBILE_NUMBER": "mobile phone number",
    "SOCIAL_HANDLE": "social media handle",
    
    # NEW Financial labels
    "CREDIT_CARD": "credit card number",
    "BANK_ACCOUNT": "bank account number",
    "CRYPTO_WALLET": "cryptocurrency wallet address",
    
    # NEW Technical labels
    "IP_ADDRESS": "IP address",
    "MAC_ADDRESS": "MAC address",
    "DEVICE_ID": "device identifier",
    
    # NEW Medical labels
    "MEDICAL_RECORD": "medical record number",
    "INSURANCE_ID": "insurance policy number",
    
    # ... (add all new labels)
}

Phase 4: Co-reference Complexity

File: model/dataset/prompts.py

Add co-reference complexity parameter:

def build_generation_prompt(
    labels: dict[str, str],
    languages: list[str],
    coref_complexity: str = "simple",  # NEW: simple, medium, complex
    sample_index: int = 0
) -> str:
    complexity_instructions = {
        "simple": "1-2 entities with direct pronoun references",
        "medium": "2-3 entities with cross-sentence references",
        "complex": "3+ entities with ambiguous references and nested clusters",
    }
    
    # Include in prompt...

Phase 5: Validation and Quality Control

File: model/dataset/validation.py (new)

Create validation script:

"""
Validate dataset diversity and quality.

Checks:
- Language distribution balance
- PII type coverage
- Co-reference complexity levels
- Scenario representation
- Text length distribution
- Entity density
"""

def validate_dataset_diversity(dataset_path: str) -> dict:
    """Generate diversity report."""
    
    stats = {
        "languages": Counter(),
        "pii_types": Counter(),
        "scenarios": Counter(),
        "coref_complexity": {"simple": 0, "medium": 0, "complex": 0},
        "text_lengths": [],
        "entity_densities": [],
    }
    
    # Analyze all samples...
    
    # Check for gaps
    warnings = []
    if stats["languages"]["English"] > 0.3 * total:
        warnings.append("English over-represented (>30%)")
    
    # ... more checks
    
    return {"stats": stats, "warnings": warnings}

Phase 6: Makefile Integration

File: Makefile

Add dataset extension commands:

# Dataset extension targets
.PHONY: dataset-extend dataset-validate dataset-balance

dataset-extend-tier1:  ## Generate Tier 1 language samples (major languages)
\t@echo "Generating Tier 1 language samples..."
\tpython -m model.dataset.training_set \\
\t\t--num_samples 5000 \\
\t\t--languages tier1 \\
\t\t--coref-complexity all

dataset-extend-scenarios:  ## Generate diverse scenario samples
\t@echo "Generating scenario-diverse samples..."
\tpython -m model.dataset.training_set \\
\t\t--num_samples 3000 \\
\t\t--scenarios legal,technical,healthcare,financial

dataset-validate:  ## Validate dataset diversity
\tpython model/dataset/validation.py \\
\t\t--input model/dataset/reviewed_samples \\
\t\t--report docs/dataset_diversity_report.md

dataset-balance:  ## Balance dataset across dimensions
\tpython model/dataset/balance.py \\
\t\t--input model/dataset/reviewed_samples \\
\t\t--target-distribution docs/target_distribution.json

Success Criteria

Language Coverage

  • 30+ languages supported in generation
  • Each tier 1 language has 1000+ samples
  • Non-Latin scripts properly handled (Chinese, Arabic, etc.)
  • Language distribution visualization in dataset card

Co-reference Complexity

  • 30% simple (1-2 entities, direct references)
  • 50% medium (2-3 entities, cross-sentence)
  • 20% complex (3+ entities, ambiguous)
  • Validation script measures complexity levels

PII Type Coverage

  • 50+ PII types labeled
  • All new PII types added to label mappings
  • Each PII type appears in 100+ samples
  • Rare PII types (crypto wallets, biometrics) covered

Scenario Diversity

  • 25+ distinct scenario types
  • Each scenario represented in 200+ samples
  • Balanced distribution across domains
  • Technical, legal, medical scenarios well-covered

Style Diversity

  • All 5 formality levels represented
  • 10+ writing styles included
  • Mixed format samples (forms, narratives, dialogue)
  • Cultural variations in naming/addressing

Quality Metrics

Documentation

  • Dataset card updated with diversity metrics
  • Example samples for each new category
  • Known limitations documented
  • Contribution guidelines for new data

Integration with Existing Issues

Issue #31: LabelStudio

Use LabelStudio to review new diverse samples:

  • Focus on complex co-reference examples
  • Validate non-English samples
  • Review rare PII types

Issue #32: Dataset Statistics

Generate diversity metrics:

  • Language distribution charts
  • PII type coverage heatmap
  • Co-reference complexity histogram
  • Scenario representation breakdown

Issue #36: ML Pipeline

Integrate diversity generation into pipeline:

@step
def generate_diverse_dataset(self):
    # Generate tier 1 languages
    # Generate complex co-references
    # Generate rare scenarios
    # Validate diversity

Example Additions

Complex Co-reference Example

{
  "text": "Sarah Martinez scheduled a meeting with Dr. Chen and Tom Wilson. She sent them the agenda via email. Dr. Chen confirmed his attendance, but Tom's assistant said he might be late. Sarah's presentation will cover the project that she and Tom have been working on together. Their manager, Dr. Chen, will evaluate their progress.",
  "privacy_mask": [
    {"value": "Sarah", "label": "FIRSTNAME"},
    {"value": "Martinez", "label": "SURNAME"},
    {"value": "Dr. Chen", "label": "TITLE"},
    {"value": "Tom", "label": "FIRSTNAME"},
    {"value": "Wilson", "label": "SURNAME"}
  ],
  "coreferences": [
    {
      "cluster_id": 1,
      "mentions": ["Sarah Martinez", "She", "Sarah's", "she"],
      "entity_type": "person"
    },
    {
      "cluster_id": 2,
      "mentions": ["Dr. Chen", "his", "Their manager", "Dr. Chen"],
      "entity_type": "person"
    },
    {
      "cluster_id": 3,
      "mentions": ["Tom Wilson", "them", "Tom's", "he", "Tom"],
      "entity_type": "person"
    },
    {
      "cluster_id": 4,
      "mentions": ["the project", "their progress"],
      "entity_type": "object"
    }
  ]
}

Multi-language Example (Japanese)

{
  "text": "山田太郎(yamada.taro@example.jp)は東京都渋谷区1-2-3に住んでいます。電話番号は03-1234-5678です。",
  "privacy_mask": [
    {"value": "山田太郎", "label": "FULL_NAME"},
    {"value": "yamada.taro@example.jp", "label": "EMAIL"},
    {"value": "東京都", "label": "STATE"},
    {"value": "渋谷区", "label": "CITY"},
    {"value": "1-2-3", "label": "BUILDINGNUM"},
    {"value": "03-1234-5678", "label": "PHONENUMBER"}
  ],
  "coreferences": []
}

New PII Types Example

{
  "text": "Device MAC: 00:1B:44:11:3A:B7, IP: 192.168.1.100. Crypto wallet: 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. Medical Record: MRN-789456.",
  "privacy_mask": [
    {"value": "00:1B:44:11:3A:B7", "label": "MAC_ADDRESS"},
    {"value": "192.168.1.100", "label": "IP_ADDRESS"},
    {"value": "1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa", "label": "CRYPTO_WALLET"},
    {"value": "MRN-789456", "label": "MEDICAL_RECORD"}
  ],
  "coreferences": []
}

Timeline

Phase 1 (Week 1-2): Language Expansion

  • Enable tier 1 languages
  • Generate 5000+ samples across 13 languages
  • Validate non-Latin scripts

Phase 2 (Week 3-4): Co-reference Complexity

  • Implement complexity parameters
  • Generate complex examples
  • Test co-reference detection

Phase 3 (Week 5-6): PII Type Extension

  • Add 30+ new PII types
  • Update label mappings
  • Generate samples with rare PII

Phase 4 (Week 7-8): Scenario Diversification

  • Create scenario templates
  • Generate domain-specific samples
  • Validate scenario coverage

Phase 5 (Week 9-10): Quality & Validation

  • Run diversity validation
  • Manual review with LabelStudio
  • Balance dataset
  • Update documentation

Future Enhancements

  1. Adversarial Examples: Deliberately challenging cases
  2. Synthetic Errors: Typos, formatting issues, OCR artifacts
  3. Multi-modal Data: Screenshots with PII, scanned documents
  4. Temporal Evolution: Different time periods, evolving formats
  5. Domain-Specific Expansion: Healthcare, finance, legal specialists

References


Notes

Target: Grow from 20k → 50k+ samples with 3x diversity improvement

This enhancement will significantly improve model robustness, reduce biases, and enable better real-world performance across languages, domains, and use cases.

Complexity: Medium-High
Impact: Very High (foundational for model quality)
Dependencies: Issues #31 (review), #32 (stats), #36 (pipeline)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions