Extend Dataset with Increased Diversity and Complexity

# Extend Dataset with Increased Diversity and Complexity

## Background

The current dataset (~20,000 samples) provides a solid foundation for PII detection, but has **limited diversity** across several critical dimensions. To build a robust, production-ready model that generalizes well to real-world scenarios, we need to significantly expand dataset coverage.

### Current Limitations

**Languages:**
- ✅ Active: 6 languages (English, German, French, Spanish, Dutch, Danish)
- ❌ Inactive: 30+ languages commented out in code
- **Impact:** Limited multilingual performance, Western European bias

**Co-reference Complexity:**
- ✅ Simple: 1-2 entity clusters, direct pronouns
- ❌ Missing: Multi-entity chains, cross-sentence references, ambiguous references
- **Impact:** Fails on complex conversational text

**PII Types:**
- ✅ Coverage: ~20 PII types (names, emails, addresses, IDs)
- ❌ Missing: Rare PII types, domain-specific identifiers, compound entities
- **Impact:** Misses edge cases in specialized domains

**Scenarios:**
- ✅ Coverage: Emails, forms, medical records, social posts
- ❌ Missing: Legal documents, technical support, contracts, chat logs, forums
- **Impact:** Lower accuracy on underrepresented text types

**Writing Styles:**
- ✅ Coverage: Formal business, medical, casual social
- ❌ Missing: Technical, legal, academic, slang, code-switched text
- **Impact:** Style-dependent performance degradation

---

## Enhancement Goals

### Goal 1: Expand Language Coverage (6 → 30+ languages)

**Priority Tier 1: Major World Languages**
- 🇬🇧 English (current)
- 🇪🇸 Spanish (current) 
- 🇫🇷 French (current)
- 🇩🇪 German (current)
- 🇨🇳 Chinese (Simplified & Traditional) ⭐ NEW
- 🇯🇵 Japanese ⭐ NEW
- 🇮🇳 Hindi ⭐ NEW
- 🇸🇦 Arabic ⭐ NEW
- 🇵🇹 Portuguese (Brazil & Portugal) ⭐ NEW
- 🇷🇺 Russian ⭐ NEW
- 🇰🇷 Korean ⭐ NEW

**Priority Tier 2: Regional Languages**
- 🇮🇹 Italian
- 🇳🇱 Dutch (current)
- 🇵🇱 Polish
- 🇹🇷 Turkish
- 🇸🇪 Swedish
- 🇳🇴 Norwegian
- 🇩🇰 Danish (current)
- 🇫🇮 Finnish
- 🇬🇷 Greek
- 🇷🇴 Romanian
- 🇺🇦 Ukrainian
- 🇨🇿 Czech
- 🇮🇩 Indonesian
- 🇻🇳 Vietnamese
- 🇹🇭 Thai
- 🇵🇭 Tagalog
- 🇰🇪 Swahili
- 🇿🇦 Afrikaans

### Goal 2: Increase Co-reference Complexity

**Current:**
```
"John Smith called. He said his email is john@email.com."
Clusters: [John Smith, He, his] → Simple 1-cluster
```

**Target - Multi-Entity Chains:**
```
"Sarah met Tom at the office. She told him about the project. 
They discussed it with their manager, Dr. Chen, who approved 
their proposal. Sarah's team will work with Tom's department."

Clusters:
- Cluster 1: Sarah, She, her, Sarah's
- Cluster 2: Tom, him, Tom's, his
- Cluster 3: Dr. Chen, their manager, who
- Cluster 4: the project, it, their proposal
- Cluster 5: Sarah's team, their (ambiguous)
```

**Target - Cross-Sentence References:**
```
"Contact: john.doe@company.com
Department: Engineering
Manager: Sarah Thompson

John has been with the company for 5 years. His expertise
in machine learning makes him an excellent candidate. We
recommend scheduling an interview with him and his team lead."

Clusters spanning multiple sentences
```

**Target - Ambiguous References:**
```
"The customer called about their order. The representative 
said they would check the status. They both agreed to follow up."

Ambiguous "they" - requires context resolution
```

### Goal 3: Expand PII Types

**Current Coverage (20 types):**
- Names: FIRSTNAME, SURNAME
- Contact: EMAIL, PHONENUMBER, URL
- Location: ADDRESS, CITY, STATE, ZIP, COUNTRY, STREET, BUILDINGNUM
- IDs: SSN, DRIVER_LICENSE, PASSPORT, NATIONAL_ID, IDCARDNUM, TAXNUM
- Financial: IBAN
- Other: DATEOFBIRTH, AGE, COMPANYNAME

**New PII Types to Add:**

**Identity:**
- MIDDLENAME
- NICKNAME / ALIAS
- USERNAME (social media handles)
- FULL_NAME (compound entity)
- TITLE (Dr., Prof., Mr., Mrs.)

**Contact Extended:**
- FAX_NUMBER
- MOBILE_NUMBER (separate from landline)
- EXTENSION (phone extension)
- SOCIAL_MEDIA_HANDLE (@username)
- SKYPE_ID / ZOOM_ID

**Location Extended:**
- APARTMENT_NUMBER
- FLOOR_NUMBER
- POSTAL_CODE (international)
- LANDMARK
- GPS_COORDINATES

**IDs Extended:**
- MEDICAL_RECORD_NUMBER
- PATIENT_ID
- EMPLOYEE_ID
- STUDENT_ID
- INSURANCE_POLICY_NUMBER
- MEMBERSHIP_NUMBER
- VEHICLE_VIN
- LICENSE_PLATE (extended beyond LICENSEPLATENUM)

**Financial Extended:**
- CREDIT_CARD_NUMBER
- BANK_ACCOUNT_NUMBER
- ROUTING_NUMBER
- SWIFT_CODE
- CRYPTO_WALLET_ADDRESS
- PAYMENT_CARD_CVV

**Biometric:**
- IP_ADDRESS
- MAC_ADDRESS
- DEVICE_ID / IMEI
- BIOMETRIC_ID

**Temporal:**
- TIME (specific time references)
- TIMESTAMP
- AGE_RANGE (instead of exact age)

**Professional:**
- JOB_TITLE
- EMPLOYER
- SALARY / COMPENSATION
- WORK_EMAIL (vs personal)

**Medical:**
- MEDICAL_CONDITION
- PRESCRIPTION_NUMBER
- HEALTHCARE_PROVIDER
- INSURANCE_GROUP_NUMBER

### Goal 4: Diversify Scenarios and Contexts

**Current Scenarios:**
- Business emails
- Medical records
- Customer service forms
- Social media posts

**New Scenarios to Add:**

**Professional:**
- Legal contracts and NDAs
- HR documents (resumes, performance reviews)
- Technical support tickets
- Meeting transcripts / notes
- Project proposals
- Sales communications

**Consumer:**
- E-commerce orders and receipts
- Shipping/tracking notifications
- Product reviews with PII
- Customer complaints
- Subscription cancellations
- Warranty claims

**Financial:**
- Bank statements
- Loan applications
- Tax documents
- Investment portfolios
- Insurance claims
- Payment disputes

**Healthcare:**
- Patient intake forms
- Lab results
- Prescription orders
- Insurance authorizations
- Medical histories
- Appointment schedules

**Educational:**
- Student registration forms
- Transcripts and grades
- Recommendation letters
- Scholarship applications
- Course evaluations

**Government:**
- Visa applications
- Permit requests
- Voter registration
- Census forms
- Benefits applications

**Social/Personal:**
- Dating profiles
- Forum posts with personal info
- Chat conversations
- Blog comments
- Personal introductions

**Technical:**
- Bug reports with user data
- API logs with PII
- Database dumps (sanitized examples)
- Error messages with personal info
- System logs

### Goal 5: Increase Stylistic Diversity

**Formality Levels:**
- Very Formal: Legal, government, academic
- Formal: Business, medical, official
- Semi-Formal: Professional emails, reports
- Informal: Social media, blogs, forums
- Very Informal: Chat, SMS, slang

**Writing Styles:**
- Technical documentation
- Legal language
- Academic papers
- Journalistic articles
- Marketing copy
- Personal narratives
- Instructional text
- Conversational dialogue

**Tonal Variations:**
- Neutral/objective
- Friendly/warm
- Urgent/pressing
- Apologetic
- Assertive
- Empathetic

**Text Formats:**
- Structured forms (key: value)
- Free-form narratives
- Bullet points / lists
- Tables
- Mixed formats
- Code snippets with PII
- JSON/XML with personal data

### Goal 6: Add Cross-Cultural Variations

**Naming Conventions:**
- Western: First Last
- Eastern: Last First (Chinese, Korean, Japanese)
- Hispanic: Multiple surnames (María García López)
- Arabic: Long patronymic chains (Ahmed bin Mohammed bin...)
- Single names (Indonesian, Thai)
- Titles as part of name (Jr., Sr., III)

**Address Formats:**
- US: Street, City, State ZIP
- UK: Street, Town, County, Postcode
- German: Street number-after, PLZ City
- Japanese: Prefecture, City, Ward, Block
- Arabic: No street names, descriptive landmarks

**Date Formats:**
- US: MM/DD/YYYY
- European: DD/MM/YYYY
- ISO: YYYY-MM-DD
- Written: January 15, 2024 / 15 January 2024

**Phone Number Formats:**
- US: (555) 123-4567
- International: +1-555-123-4567
- UK: 020 7946 0958
- Various regional formats

---

## Implementation Plan

### Phase 1: Language Expansion

**File:** `model/dataset/training_set.py`

Enable commented-out languages progressively:

```python
def get_languages(self, language_count: int = 10, seed: int | None = 42, is_testing: bool = False):
    # Tier 1: Major world languages (add first)
    tier1_languages = (
        "English", "German", "French", "Spanish", "Dutch", "Danish",
        "Chinese (Simplified)", "Japanese", "Hindi", "Arabic",  # NEW
        "Portuguese", "Russian", "Korean",  # NEW
    )
    
    # Tier 2: Regional languages
    tier2_languages = (
        "Italian", "Polish", "Turkish", "Swedish",
        "Norwegian", "Finnish", "Greek", "Romanian",
        "Ukrainian", "Czech", "Indonesian", "Afrikaans",
        "Vietnamese", "Thai", "Tagalog", "Swahili", 
    )
    
    all_languages = tier1_languages + tier2_languages
    
    if is_testing:
        return [all_languages[0]]
    else:
        return all_languages[:language_count]
```

### Phase 2: Enhanced Prompt Templates

**File:** `model/dataset/prompts.py`

Add scenario-specific prompts:

```python
SCENARIO_TEMPLATES = {
    "legal_contract": """
Generate a sample legal contract excerpt containing PII such as:
- Party names (full legal names with titles)
- Addresses (complete with apartment/suite numbers)
- Contact information
- Identification numbers
- Dates and signatures
Include complex co-reference chains with legal terminology.
""",
    
    "technical_support": """
Generate a technical support ticket containing:
- Customer name and contact info
- Account numbers / user IDs
- Device identifiers (IP, MAC address)
- Timestamps and system details
Include conversational back-and-forth with pronouns.
""",
    
    "multi_party_email": """
Generate an email thread with 3+ participants where:
- Multiple people are CC'd with names and emails
- People refer to each other with pronouns
- Contains complex co-reference chains
- Includes meeting times, locations, phone numbers
""",
    
    # Add 20+ more scenario templates...
}
```

### Phase 3: Extended Label Configuration

**File:** `model/dataset/label_utils.py`

Add new PII labels:

```python
LABEL_DESCRIPTIONS: ClassVar[dict[str, str]] = {
    # Existing labels...
    
    # NEW Identity labels
    "MIDDLENAME": "middle name",
    "NICKNAME": "nickname or alias",
    "USERNAME": "username or handle",
    "TITLE": "personal title (Dr., Prof., etc.)",
    
    # NEW Contact labels
    "MOBILE_NUMBER": "mobile phone number",
    "SOCIAL_HANDLE": "social media handle",
    
    # NEW Financial labels
    "CREDIT_CARD": "credit card number",
    "BANK_ACCOUNT": "bank account number",
    "CRYPTO_WALLET": "cryptocurrency wallet address",
    
    # NEW Technical labels
    "IP_ADDRESS": "IP address",
    "MAC_ADDRESS": "MAC address",
    "DEVICE_ID": "device identifier",
    
    # NEW Medical labels
    "MEDICAL_RECORD": "medical record number",
    "INSURANCE_ID": "insurance policy number",
    
    # ... (add all new labels)
}
```

### Phase 4: Co-reference Complexity

**File:** `model/dataset/prompts.py`

Add co-reference complexity parameter:

```python
def build_generation_prompt(
    labels: dict[str, str],
    languages: list[str],
    coref_complexity: str = "simple",  # NEW: simple, medium, complex
    sample_index: int = 0
) -> str:
    complexity_instructions = {
        "simple": "1-2 entities with direct pronoun references",
        "medium": "2-3 entities with cross-sentence references",
        "complex": "3+ entities with ambiguous references and nested clusters",
    }
    
    # Include in prompt...
```

### Phase 5: Validation and Quality Control

**File:** `model/dataset/validation.py` (new)

Create validation script:

```python
"""
Validate dataset diversity and quality.

Checks:
- Language distribution balance
- PII type coverage
- Co-reference complexity levels
- Scenario representation
- Text length distribution
- Entity density
"""

def validate_dataset_diversity(dataset_path: str) -> dict:
    """Generate diversity report."""
    
    stats = {
        "languages": Counter(),
        "pii_types": Counter(),
        "scenarios": Counter(),
        "coref_complexity": {"simple": 0, "medium": 0, "complex": 0},
        "text_lengths": [],
        "entity_densities": [],
    }
    
    # Analyze all samples...
    
    # Check for gaps
    warnings = []
    if stats["languages"]["English"] > 0.3 * total:
        warnings.append("English over-represented (>30%)")
    
    # ... more checks
    
    return {"stats": stats, "warnings": warnings}
```

### Phase 6: Makefile Integration

**File:** `Makefile`

Add dataset extension commands:

```makefile
# Dataset extension targets
.PHONY: dataset-extend dataset-validate dataset-balance

dataset-extend-tier1:  ## Generate Tier 1 language samples (major languages)
\t@echo "Generating Tier 1 language samples..."
\tpython -m model.dataset.training_set \\
\t\t--num_samples 5000 \\
\t\t--languages tier1 \\
\t\t--coref-complexity all

dataset-extend-scenarios:  ## Generate diverse scenario samples
\t@echo "Generating scenario-diverse samples..."
\tpython -m model.dataset.training_set \\
\t\t--num_samples 3000 \\
\t\t--scenarios legal,technical,healthcare,financial

dataset-validate:  ## Validate dataset diversity
\tpython model/dataset/validation.py \\
\t\t--input model/dataset/reviewed_samples \\
\t\t--report docs/dataset_diversity_report.md

dataset-balance:  ## Balance dataset across dimensions
\tpython model/dataset/balance.py \\
\t\t--input model/dataset/reviewed_samples \\
\t\t--target-distribution docs/target_distribution.json
```

---

## Success Criteria

### Language Coverage
- [ ] 30+ languages supported in generation
- [ ] Each tier 1 language has 1000+ samples
- [ ] Non-Latin scripts properly handled (Chinese, Arabic, etc.)
- [ ] Language distribution visualization in dataset card

### Co-reference Complexity
- [ ] 30% simple (1-2 entities, direct references)
- [ ] 50% medium (2-3 entities, cross-sentence)
- [ ] 20% complex (3+ entities, ambiguous)
- [ ] Validation script measures complexity levels

### PII Type Coverage
- [ ] 50+ PII types labeled
- [ ] All new PII types added to label mappings
- [ ] Each PII type appears in 100+ samples
- [ ] Rare PII types (crypto wallets, biometrics) covered

### Scenario Diversity
- [ ] 25+ distinct scenario types
- [ ] Each scenario represented in 200+ samples
- [ ] Balanced distribution across domains
- [ ] Technical, legal, medical scenarios well-covered

### Style Diversity
- [ ] All 5 formality levels represented
- [ ] 10+ writing styles included
- [ ] Mixed format samples (forms, narratives, dialogue)
- [ ] Cultural variations in naming/addressing

### Quality Metrics
- [ ] Dataset validation script passes all checks
- [ ] No single dimension >30% of total
- [ ] Manual review of 100 random samples
- [ ] Integration with LabelStudio (Issue #31) for QA

### Documentation
- [ ] Dataset card updated with diversity metrics
- [ ] Example samples for each new category
- [ ] Known limitations documented
- [ ] Contribution guidelines for new data

---

## Integration with Existing Issues

### Issue #31: LabelStudio
Use LabelStudio to review new diverse samples:
- Focus on complex co-reference examples
- Validate non-English samples
- Review rare PII types

### Issue #32: Dataset Statistics
Generate diversity metrics:
- Language distribution charts
- PII type coverage heatmap
- Co-reference complexity histogram
- Scenario representation breakdown

### Issue #36: ML Pipeline
Integrate diversity generation into pipeline:
```python
@step
def generate_diverse_dataset(self):
    # Generate tier 1 languages
    # Generate complex co-references
    # Generate rare scenarios
    # Validate diversity
```

---

## Example Additions

### Complex Co-reference Example

```json
{
  "text": "Sarah Martinez scheduled a meeting with Dr. Chen and Tom Wilson. She sent them the agenda via email. Dr. Chen confirmed his attendance, but Tom's assistant said he might be late. Sarah's presentation will cover the project that she and Tom have been working on together. Their manager, Dr. Chen, will evaluate their progress.",
  "privacy_mask": [
    {"value": "Sarah", "label": "FIRSTNAME"},
    {"value": "Martinez", "label": "SURNAME"},
    {"value": "Dr. Chen", "label": "TITLE"},
    {"value": "Tom", "label": "FIRSTNAME"},
    {"value": "Wilson", "label": "SURNAME"}
  ],
  "coreferences": [
    {
      "cluster_id": 1,
      "mentions": ["Sarah Martinez", "She", "Sarah's", "she"],
      "entity_type": "person"
    },
    {
      "cluster_id": 2,
      "mentions": ["Dr. Chen", "his", "Their manager", "Dr. Chen"],
      "entity_type": "person"
    },
    {
      "cluster_id": 3,
      "mentions": ["Tom Wilson", "them", "Tom's", "he", "Tom"],
      "entity_type": "person"
    },
    {
      "cluster_id": 4,
      "mentions": ["the project", "their progress"],
      "entity_type": "object"
    }
  ]
}
```

### Multi-language Example (Japanese)

```json
{
  "text": "山田太郎（yamada.taro@example.jp）は東京都渋谷区1-2-3に住んでいます。電話番号は03-1234-5678です。",
  "privacy_mask": [
    {"value": "山田太郎", "label": "FULL_NAME"},
    {"value": "yamada.taro@example.jp", "label": "EMAIL"},
    {"value": "東京都", "label": "STATE"},
    {"value": "渋谷区", "label": "CITY"},
    {"value": "1-2-3", "label": "BUILDINGNUM"},
    {"value": "03-1234-5678", "label": "PHONENUMBER"}
  ],
  "coreferences": []
}
```

### New PII Types Example

```json
{
  "text": "Device MAC: 00:1B:44:11:3A:B7, IP: 192.168.1.100. Crypto wallet: 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. Medical Record: MRN-789456.",
  "privacy_mask": [
    {"value": "00:1B:44:11:3A:B7", "label": "MAC_ADDRESS"},
    {"value": "192.168.1.100", "label": "IP_ADDRESS"},
    {"value": "1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa", "label": "CRYPTO_WALLET"},
    {"value": "MRN-789456", "label": "MEDICAL_RECORD"}
  ],
  "coreferences": []
}
```

---

## Timeline

**Phase 1 (Week 1-2): Language Expansion**
- Enable tier 1 languages
- Generate 5000+ samples across 13 languages
- Validate non-Latin scripts

**Phase 2 (Week 3-4): Co-reference Complexity**
- Implement complexity parameters
- Generate complex examples
- Test co-reference detection

**Phase 3 (Week 5-6): PII Type Extension**
- Add 30+ new PII types
- Update label mappings
- Generate samples with rare PII

**Phase 4 (Week 7-8): Scenario Diversification**
- Create scenario templates
- Generate domain-specific samples
- Validate scenario coverage

**Phase 5 (Week 9-10): Quality & Validation**
- Run diversity validation
- Manual review with LabelStudio
- Balance dataset
- Update documentation

---

## Future Enhancements

1. **Adversarial Examples**: Deliberately challenging cases
2. **Synthetic Errors**: Typos, formatting issues, OCR artifacts
3. **Multi-modal Data**: Screenshots with PII, scanned documents
4. **Temporal Evolution**: Different time periods, evolving formats
5. **Domain-Specific Expansion**: Healthcare, finance, legal specialists

---

## References

- [Dataset Diversity Best Practices](https://arxiv.org/abs/2204.13399)
- [Multilingual NER Challenges](https://arxiv.org/abs/2002.12127)
- [Co-reference Resolution Survey](https://arxiv.org/abs/2009.12727)
- Issue #31: LabelStudio for human review
- Issue #32: Dataset statistics and visualization
- Issue #36: ML pipeline integration

---

## Notes

**Target**: Grow from 20k → 50k+ samples with 3x diversity improvement

This enhancement will significantly improve model robustness, reduce biases, and enable better real-world performance across languages, domains, and use cases.

**Complexity**: Medium-High  
**Impact**: Very High (foundational for model quality)  
**Dependencies**: Issues #31 (review), #32 (stats), #36 (pipeline)



Extend Dataset with Increased Diversity and Complexity #37

Description

Extend Dataset with Increased Diversity and Complexity

Background

Current Limitations

Enhancement Goals

Goal 1: Expand Language Coverage (6 → 30+ languages)

Goal 2: Increase Co-reference Complexity

Goal 3: Expand PII Types

Goal 4: Diversify Scenarios and Contexts

Goal 5: Increase Stylistic Diversity

Goal 6: Add Cross-Cultural Variations

Implementation Plan

Phase 1: Language Expansion

Phase 2: Enhanced Prompt Templates

Phase 3: Extended Label Configuration

Phase 4: Co-reference Complexity

Phase 5: Validation and Quality Control

Phase 6: Makefile Integration

Success Criteria

Language Coverage

Co-reference Complexity

PII Type Coverage

Scenario Diversity

Style Diversity

Quality Metrics

Documentation

Integration with Existing Issues

Issue #31: LabelStudio

Issue #32: Dataset Statistics

Issue #36: ML Pipeline

Example Additions

Complex Co-reference Example

Multi-language Example (Japanese)

New PII Types Example

Timeline

Future Enhancements

References

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions