[Feature] PII Masking Rules at Ingestion

## Feature Description
Automatically detect and mask Personally Identifiable Information (PII) in log entries before storage, ensuring GDPR compliance and reducing data breach risk. This feature provides configurable patterns for common PII types (emails, credit cards, phone numbers, IPs) with multiple masking strategies (mask, hash, redact).

## Problem/Use Case
**Current problem:**
- Developers accidentally log sensitive data (emails, credit cards, passwords)
- Once logged, PII is stored permanently and visible to anyone with log access
- GDPR violations can result in massive fines (up to 4% of annual revenue)
- Security audits flag PII in logs as high-risk
- Manual PII removal is time-consuming and error-prone
- Compliance teams demand "no PII in logs" policies

**Real-world scenarios:**

**Scenario 1: Accidental credit card logging**
```javascript
// Developer logs entire request for debugging
logger.info('Payment request:', JSON.stringify(req.body));
// → Logs credit card number, CVV, everything

Without PII masking:
{"card_number": "4532-1234-5678-9010", "cvv": "123"}

With PII masking:
{"card_number": "****-****-****-9010", "cvv": "***"}
```

**Scenario 2: Email addresses in errors**
```
Error: Invalid email format for user@example.com
→ Email visible to all log viewers

With PII masking:
Error: Invalid email format for u***@example.com
```

**Scenario 3: GDPR "right to be forgotten"**
```
User requests data deletion
→ Must scrub their email from ALL logs (nightmare!)

With PII masking:
→ Email was never stored, already masked
```

## Proposed Solution

**Implement configurable PII detection and masking at log ingestion:**

**Phase 1: Common PII patterns**
```yaml
pii_masking:
  enabled: true
  
  patterns:
    - type: email
      regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
      action: mask  # user@example.com → u***@e***.com
      
    - type: credit_card
      regex: "\\b(?:\\d{4}[- ]?){3}\\d{4}\\b"
      action: redact  # 4532-1234-5678-9010 → ****-****-****-9010
      luhn_check: true  # Only match valid card numbers
      
    - type: phone_number
      regex: "\\b(?:\\+?1[-.]?)?\$?([0-9]{3})\$?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b"
      action: mask  # +1-555-123-4567 → +1-555-***-****
      
    - type: ssn
      regex: "\\b(?!000|666|9\\d{2})([0-8]\\d{2}|7([0-6]\\d))[-]?(?!00)\\d{2}[-]?(?!0000)\\d{4}\\b"
      action: redact  # 123-45-6789 → ***-**-****
      
    - type: ip_address
      regex: "\\b(?:[0-9]{1,3}\\.){3}[0-9]{1,3}\\b"
      action: hash  # 192.168.1.100 → ip_a3f5b8c9
      enabled: false  # Optional, disabled by default
      
    - type: api_key
      regex: "\\b[A-Za-z0-9_-]{32,}\\b"
      context: "(?:api[_-]?key|token|secret|password)"  # Only match near these words
      action: redact  # sk_live_abc123... → [REDACTED_API_KEY]
```

**Masking strategies:**

1. **mask**: Partial masking, keep some characters for debugging
   ```
   user@example.com → u***@e***.com
   4532-1234-5678-9010 → ****-****-****-9010
   ```

2. **redact**: Complete removal
   ```
   user@example.com → [REDACTED_EMAIL]
   123-45-6789 → [REDACTED_SSN]
   ```

3. **hash**: One-way hash (can correlate, can't reverse)
   ```
   user@example.com → email_a3f5b8c9d1e2f3a4
   192.168.1.100 → ip_b4c6d8e0f2a1b3c5
   ```

**Phase 2: Structured field masking**
```json
// For structured logs (JSON)
{
  "user": {
    "email": "user@example.com",  // ← Masked
    "id": "user_123",
    "ip": "192.168.1.100"  // ← Optionally masked
  },
  "payment": {
    "card_number": "4532-1234-5678-9010",  // ← Masked
    "amount": 49.99
  }
}

// After masking:
{
  "user": {
    "email": "[REDACTED_EMAIL]",
    "id": "user_123",
    "ip": "192.168.1.100"
  },
  "payment": {
    "card_number": "****-****-****-9010",
    "amount": 49.99
  }
}
```

**Phase 3: Custom patterns**
```yaml
# User-defined patterns for company-specific PII
custom_patterns:
  - name: "Internal Employee ID"
    regex: "EMP-[0-9]{6}"
    action: hash
    
  - name: "Customer Reference"
    regex: "CUS-[A-Z0-9]{8}"
    action: mask
```

## Alternatives Considered

1. **Client-side masking (in SDKs)**
   - ✗ Can't enforce (devs forget to use it)
   - ✗ Doesn't help with syslog, OTLP, or raw logs
   - ✓ Could complement server-side masking

2. **Post-processing masking (after storage)**
   - ✗ PII already stored (compliance violation)
   - ✗ Can't fully delete (backups, replicas)
   - ✗ Doesn't prevent data breaches

3. **No masking, rely on access controls**
   - ✗ Doesn't solve accidental logging
   - ✗ Doesn't help with data breaches
   - ✗ Doesn't satisfy GDPR requirements

4. **Manual review before logging**
   - ✗ Impossible at scale
   - ✗ Human error inevitable
   - ✗ Slows down development

**Chosen approach:** Automatic detection and masking at ingestion (before storage)

## Implementation Details (Optional)

**Technical implementation:**

**1. Ingestion pipeline integration**
```typescript
// Add PII masking middleware to ingestion pipeline
async function ingestLog(entry: LogEntry): Promise<void> {
  // 1. Parse log entry
  const parsed = parseLogEntry(entry);
  
  // 2. Apply PII masking
  const masked = await maskPII(parsed);
  
  // 3. Store masked log
  await storageEngine.insert(masked);
}
```

**2. PII detection engine**
```typescript
interface PIIPattern {
  type: string;
  regex: RegExp;
  action: 'mask' | 'redact' | 'hash';
  luhnCheck?: boolean;
  contextRegex?: RegExp;
}

class PIIMasker {
  private patterns: PIIPattern[];
  
  constructor(config: PIIMaskingConfig) {
    this.patterns = this.compilePatterns(config.patterns);
  }
  
  maskMessage(message: string): string {
    let masked = message;
    
    for (const pattern of this.patterns) {
      // Apply context filter if specified
      if (pattern.contextRegex && !pattern.contextRegex.test(masked)) {
        continue;
      }
      
      masked = masked.replace(pattern.regex, (match) => {
        // Luhn check for credit cards
        if (pattern.luhnCheck && !this.passesLuhnCheck(match)) {
          return match; // Not a valid card number, skip
        }
        
        return this.applyMasking(match, pattern);
      });
    }
    
    return masked;
  }
  
  private applyMasking(value: string, pattern: PIIPattern): string {
    switch (pattern.action) {
      case 'mask':
        return this.partialMask(value, pattern.type);
      case 'redact':
        return `[REDACTED_${pattern.type.toUpperCase()}]`;
      case 'hash':
        return `${pattern.type}_${this.hash(value)}`;
    }
  }
  
  private partialMask(value: string, type: string): string {
    if (type === 'email') {
      const [local, domain] = value.split('@');
      const maskedLocal = local[0] + '***';
      const maskedDomain = domain.split('.').map(part => part[0] + '***').join('.');
      return `${maskedLocal}@${maskedDomain}`;
    }
    
    if (type === 'credit_card') {
      // Show only last 4 digits
      return value.replace(/.(?=.{4})/g, '*');
    }
    
    // Generic masking
    return value.replace(/.(?=.{3})/g, '*');
  }
  
  private hash(value: string): string {
    return crypto.createHash('sha256')
      .update(value + process.env.PII_HASH_SALT)
      .digest('hex')
      .substring(0, 16);
  }
  
  private passesLuhnCheck(cardNumber: string): boolean {
    const digits = cardNumber.replace(/\D/g, '');
    // Implement Luhn algorithm
    // ...
    return true; // simplified
  }
}
```

**3. Structured data handling**
```typescript
function maskStructuredLog(data: any, patterns: PIIPattern[]): any {
  if (typeof data === 'string') {
    return maskMessage(data);
  }
  
  if (Array.isArray(data)) {
    return data.map(item => maskStructuredLog(item, patterns));
  }
  
  if (typeof data === 'object' && data !== null) {
    const masked: any = {};
    
    for (const [key, value] of Object.entries(data)) {
      // Check if field name suggests PII
      const isPIIField = /email|password|ssn|card|phone|secret|token/i.test(key);
      
      if (isPIIField && typeof value === 'string') {
        masked[key] = maskMessage(value);
      } else {
        masked[key] = maskStructuredLog(value, patterns);
      }
    }
    
    return masked;
  }
  
  return data;
}
```

**4. Configuration UI**
```typescript
// PII Masking settings page
function PIIMaskingSettings() {
  const [patterns, setPatterns] = useState<PIIPattern[]>([]);
  const [testLog, setTestLog] = useState('');
  const [maskedPreview, setMaskedPreview] = useState('');
  
  function addPattern(pattern: PIIPattern) {
    setPatterns([...patterns, pattern]);
  }
  
  function testMasking() {
    const masked = new PIIMasker({ patterns }).maskMessage(testLog);
    setMaskedPreview(masked);
  }
  
  return (
    <div>
      <h2>PII Masking Configuration</h2>
      
      <PatternList 
        patterns={patterns}
        onAdd={addPattern}
        onRemove={removePattern}
      />
      
      <TestPanel>
        <label>Test Input:</label>
        <textarea 
          value={testLog}
          onChange={(e) => setTestLog(e.target.value)}
          placeholder="Paste a log entry to test masking..."
        />
        
        <button onClick={testMasking}>Test Masking</button>
        
        <label>Masked Output:</label>
        <pre>{maskedPreview}</pre>
      </TestPanel>
    </div>
  );
}
```

**5. Performance optimization**
```typescript
// Cache compiled regexes
const regexCache = new Map<string, RegExp>();

// Batch processing for high throughput
async function maskBatch(entries: LogEntry[]): Promise<LogEntry[]> {
  return Promise.all(entries.map(entry => maskEntry(entry)));
}

// Skip masking for non-sensitive sources
if (source.skipPIIMasking) {
  return entry; // Trust internal logs, skip expensive regex
}
```

**Database schema:**
```sql
-- Track masking metadata
CREATE TABLE pii_masking_stats (
  date DATE NOT NULL,
  pattern_type VARCHAR(50),
  occurrences INTEGER,
  source_id UUID REFERENCES sources(id),
  
  PRIMARY KEY (date, pattern_type, source_id)
);

-- For compliance auditing
CREATE TABLE pii_masking_audit (
  id UUID PRIMARY KEY,
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  pattern_type VARCHAR(50),
  action VARCHAR(20), -- 'mask', 'redact', 'hash'
  source_id UUID,
  log_id UUID
);
```

## Priority
- [ ] Critical - Blocking my usage of LogTide
- [ ] High - Would significantly improve my workflow
- [x] Medium - Nice to have
- [ ] Low - Minor enhancement

**Rationale:** Essential for **GDPR compliance and enterprise adoption**, but not blocking for most current users. Higher priority for EU market and regulated industries.

## Target Users
- [x] DevOps Engineers (enforce compliance)
- [x] Developers (prevent accidental PII logging)
- [x] Security/SIEM Users (data protection)
- [ ] System Administrators
- [ ] All Users

**Primary benefit:** Organizations that handle user data and need GDPR/compliance guarantees.

**Secondary benefit:** Reduces data breach risk for everyone.

## Additional Context

**Why this is critical for growth:**

**1. GDPR compliance requirement**
```
GDPR Article 5: Data minimization
→ "Personal data shall be adequate, relevant and limited to what is necessary"
→ Storing PII in logs violates this unless there's a specific reason

GDPR fines:
→ Up to €20 million or 4% of annual revenue (whichever is higher)
→ Real example: British Airways fined £20M for data breach
```

**2. Market differentiation**
```
Competitors:
• Datadog: Client-side masking only (can be bypassed)
• Elastic: Manual configuration (complex, error-prone)
• Splunk: Has PII detection but enterprise-tier only
• Grafana Loki: No built-in PII masking

Logtide advantage:
✓ Built-in, automatic detection
✓ Configurable patterns
✓ Works with any log source
✓ Free tier includes PII masking (not paywalled)
```

**3. Enterprise sales enabler**
```
Common enterprise question: "How do you handle PII?"

Without this feature:
❌ "You'll need to configure client-side masking in your SDKs"
→ Enterprise: "That's not acceptable" (lost deal)

With this feature:
✓ "Logtide automatically detects and masks PII at ingestion"
✓ "GDPR-compliant out of the box"
✓ "No code changes required"
→ Enterprise: "Perfect, when can we start?" (closed deal)
```

**Real-world impact examples:**

**Example 1: Startup avoids GDPR violation**
```
Scenario: Developer accidentally logs user emails in error messages
Without masking: GDPR violation, potential €50k fine
With masking: Emails automatically redacted, no violation
```

**Example 2: Security breach damage limitation**
```
Scenario: Attacker gains access to log database
Without masking: Full credit card numbers, emails, SSNs exposed
With masking: Only masked/hashed data visible (useless to attacker)
```

**Marketing angles:**

> "GDPR-compliant by default. Logtide automatically protects sensitive data in your logs."

> "Stop worrying about PII in logs. Logtide masks emails, credit cards, and phone numbers before storage."

**Documentation needs:**
- [ ] PII masking configuration guide
- [ ] GDPR compliance whitepaper
- [ ] Best practices for sensitive data
- [ ] Custom pattern examples
- [ ] Performance impact notes

**Blog post opportunity:**
```
"The Hidden GDPR Risk in Your Logs (And How to Fix It)"
- Explain common PII logging mistakes
- Show GDPR requirements
- Demo Logtide's automatic masking
- Position as privacy-first
```

**Future enhancements:**
- ML-based PII detection (detect new patterns automatically)
- Industry-specific patterns (healthcare, finance)
- Compliance reporting ("X emails masked this month")
- Integration with DLP tools
- Regional pattern variants (EU vs US phone numbers)

## Contribution
- [ ] I would like to work on implementing this feature



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] PII Masking Rules at Ingestion #92

Feature Description

Problem/Use Case

Proposed Solution

Alternatives Considered

Implementation Details (Optional)

Priority

Target Users

Additional Context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] PII Masking Rules at Ingestion #92

Description

Feature Description

Problem/Use Case

Proposed Solution

Alternatives Considered

Implementation Details (Optional)

Priority

Target Users

Additional Context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions