Skip to content

[Feature] PII Masking Rules at Ingestion #92

@Polliog

Description

@Polliog

Feature Description

Automatically detect and mask Personally Identifiable Information (PII) in log entries before storage, ensuring GDPR compliance and reducing data breach risk. This feature provides configurable patterns for common PII types (emails, credit cards, phone numbers, IPs) with multiple masking strategies (mask, hash, redact).

Problem/Use Case

Current problem:

  • Developers accidentally log sensitive data (emails, credit cards, passwords)
  • Once logged, PII is stored permanently and visible to anyone with log access
  • GDPR violations can result in massive fines (up to 4% of annual revenue)
  • Security audits flag PII in logs as high-risk
  • Manual PII removal is time-consuming and error-prone
  • Compliance teams demand "no PII in logs" policies

Real-world scenarios:

Scenario 1: Accidental credit card logging

// Developer logs entire request for debugging
logger.info('Payment request:', JSON.stringify(req.body));
// → Logs credit card number, CVV, everything

Without PII masking:
{"card_number": "4532-1234-5678-9010", "cvv": "123"}

With PII masking:
{"card_number": "****-****-****-9010", "cvv": "***"}

Scenario 2: Email addresses in errors

Error: Invalid email format for user@example.com
→ Email visible to all log viewers

With PII masking:
Error: Invalid email format for u***@example.com

Scenario 3: GDPR "right to be forgotten"

User requests data deletion
→ Must scrub their email from ALL logs (nightmare!)

With PII masking:
→ Email was never stored, already masked

Proposed Solution

Implement configurable PII detection and masking at log ingestion:

Phase 1: Common PII patterns

pii_masking:
  enabled: true
  
  patterns:
    - type: email
      regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
      action: mask  # user@example.com → u***@e***.com
      
    - type: credit_card
      regex: "\\b(?:\\d{4}[- ]?){3}\\d{4}\\b"
      action: redact  # 4532-1234-5678-9010 → ****-****-****-9010
      luhn_check: true  # Only match valid card numbers
      
    - type: phone_number
      regex: "\\b(?:\\+?1[-.]?)?\\(?([0-9]{3})\\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b"
      action: mask  # +1-555-123-4567 → +1-555-***-****
      
    - type: ssn
      regex: "\\b(?!000|666|9\\d{2})([0-8]\\d{2}|7([0-6]\\d))[-]?(?!00)\\d{2}[-]?(?!0000)\\d{4}\\b"
      action: redact  # 123-45-6789 → ***-**-****
      
    - type: ip_address
      regex: "\\b(?:[0-9]{1,3}\\.){3}[0-9]{1,3}\\b"
      action: hash  # 192.168.1.100 → ip_a3f5b8c9
      enabled: false  # Optional, disabled by default
      
    - type: api_key
      regex: "\\b[A-Za-z0-9_-]{32,}\\b"
      context: "(?:api[_-]?key|token|secret|password)"  # Only match near these words
      action: redact  # sk_live_abc123... → [REDACTED_API_KEY]

Masking strategies:

  1. mask: Partial masking, keep some characters for debugging

    user@example.com → u***@e***.com
    4532-1234-5678-9010 → ****-****-****-9010
    
  2. redact: Complete removal

    user@example.com → [REDACTED_EMAIL]
    123-45-6789 → [REDACTED_SSN]
    
  3. hash: One-way hash (can correlate, can't reverse)

    user@example.com → email_a3f5b8c9d1e2f3a4
    192.168.1.100 → ip_b4c6d8e0f2a1b3c5
    

Phase 2: Structured field masking

// For structured logs (JSON)
{
  "user": {
    "email": "user@example.com",  // ← Masked
    "id": "user_123",
    "ip": "192.168.1.100"  // ← Optionally masked
  },
  "payment": {
    "card_number": "4532-1234-5678-9010",  // ← Masked
    "amount": 49.99
  }
}

// After masking:
{
  "user": {
    "email": "[REDACTED_EMAIL]",
    "id": "user_123",
    "ip": "192.168.1.100"
  },
  "payment": {
    "card_number": "****-****-****-9010",
    "amount": 49.99
  }
}

Phase 3: Custom patterns

# User-defined patterns for company-specific PII
custom_patterns:
  - name: "Internal Employee ID"
    regex: "EMP-[0-9]{6}"
    action: hash
    
  - name: "Customer Reference"
    regex: "CUS-[A-Z0-9]{8}"
    action: mask

Alternatives Considered

  1. Client-side masking (in SDKs)

    • ✗ Can't enforce (devs forget to use it)
    • ✗ Doesn't help with syslog, OTLP, or raw logs
    • ✓ Could complement server-side masking
  2. Post-processing masking (after storage)

    • ✗ PII already stored (compliance violation)
    • ✗ Can't fully delete (backups, replicas)
    • ✗ Doesn't prevent data breaches
  3. No masking, rely on access controls

    • ✗ Doesn't solve accidental logging
    • ✗ Doesn't help with data breaches
    • ✗ Doesn't satisfy GDPR requirements
  4. Manual review before logging

    • ✗ Impossible at scale
    • ✗ Human error inevitable
    • ✗ Slows down development

Chosen approach: Automatic detection and masking at ingestion (before storage)

Implementation Details (Optional)

Technical implementation:

1. Ingestion pipeline integration

// Add PII masking middleware to ingestion pipeline
async function ingestLog(entry: LogEntry): Promise<void> {
  // 1. Parse log entry
  const parsed = parseLogEntry(entry);
  
  // 2. Apply PII masking
  const masked = await maskPII(parsed);
  
  // 3. Store masked log
  await storageEngine.insert(masked);
}

2. PII detection engine

interface PIIPattern {
  type: string;
  regex: RegExp;
  action: 'mask' | 'redact' | 'hash';
  luhnCheck?: boolean;
  contextRegex?: RegExp;
}

class PIIMasker {
  private patterns: PIIPattern[];
  
  constructor(config: PIIMaskingConfig) {
    this.patterns = this.compilePatterns(config.patterns);
  }
  
  maskMessage(message: string): string {
    let masked = message;
    
    for (const pattern of this.patterns) {
      // Apply context filter if specified
      if (pattern.contextRegex && !pattern.contextRegex.test(masked)) {
        continue;
      }
      
      masked = masked.replace(pattern.regex, (match) => {
        // Luhn check for credit cards
        if (pattern.luhnCheck && !this.passesLuhnCheck(match)) {
          return match; // Not a valid card number, skip
        }
        
        return this.applyMasking(match, pattern);
      });
    }
    
    return masked;
  }
  
  private applyMasking(value: string, pattern: PIIPattern): string {
    switch (pattern.action) {
      case 'mask':
        return this.partialMask(value, pattern.type);
      case 'redact':
        return `[REDACTED_${pattern.type.toUpperCase()}]`;
      case 'hash':
        return `${pattern.type}_${this.hash(value)}`;
    }
  }
  
  private partialMask(value: string, type: string): string {
    if (type === 'email') {
      const [local, domain] = value.split('@');
      const maskedLocal = local[0] + '***';
      const maskedDomain = domain.split('.').map(part => part[0] + '***').join('.');
      return `${maskedLocal}@${maskedDomain}`;
    }
    
    if (type === 'credit_card') {
      // Show only last 4 digits
      return value.replace(/.(?=.{4})/g, '*');
    }
    
    // Generic masking
    return value.replace(/.(?=.{3})/g, '*');
  }
  
  private hash(value: string): string {
    return crypto.createHash('sha256')
      .update(value + process.env.PII_HASH_SALT)
      .digest('hex')
      .substring(0, 16);
  }
  
  private passesLuhnCheck(cardNumber: string): boolean {
    const digits = cardNumber.replace(/\D/g, '');
    // Implement Luhn algorithm
    // ...
    return true; // simplified
  }
}

3. Structured data handling

function maskStructuredLog(data: any, patterns: PIIPattern[]): any {
  if (typeof data === 'string') {
    return maskMessage(data);
  }
  
  if (Array.isArray(data)) {
    return data.map(item => maskStructuredLog(item, patterns));
  }
  
  if (typeof data === 'object' && data !== null) {
    const masked: any = {};
    
    for (const [key, value] of Object.entries(data)) {
      // Check if field name suggests PII
      const isPIIField = /email|password|ssn|card|phone|secret|token/i.test(key);
      
      if (isPIIField && typeof value === 'string') {
        masked[key] = maskMessage(value);
      } else {
        masked[key] = maskStructuredLog(value, patterns);
      }
    }
    
    return masked;
  }
  
  return data;
}

4. Configuration UI

// PII Masking settings page
function PIIMaskingSettings() {
  const [patterns, setPatterns] = useState<PIIPattern[]>([]);
  const [testLog, setTestLog] = useState('');
  const [maskedPreview, setMaskedPreview] = useState('');
  
  function addPattern(pattern: PIIPattern) {
    setPatterns([...patterns, pattern]);
  }
  
  function testMasking() {
    const masked = new PIIMasker({ patterns }).maskMessage(testLog);
    setMaskedPreview(masked);
  }
  
  return (
    <div>
      <h2>PII Masking Configuration</h2>
      
      <PatternList 
        patterns={patterns}
        onAdd={addPattern}
        onRemove={removePattern}
      />
      
      <TestPanel>
        <label>Test Input:</label>
        <textarea 
          value={testLog}
          onChange={(e) => setTestLog(e.target.value)}
          placeholder="Paste a log entry to test masking..."
        />
        
        <button onClick={testMasking}>Test Masking</button>
        
        <label>Masked Output:</label>
        <pre>{maskedPreview}</pre>
      </TestPanel>
    </div>
  );
}

5. Performance optimization

// Cache compiled regexes
const regexCache = new Map<string, RegExp>();

// Batch processing for high throughput
async function maskBatch(entries: LogEntry[]): Promise<LogEntry[]> {
  return Promise.all(entries.map(entry => maskEntry(entry)));
}

// Skip masking for non-sensitive sources
if (source.skipPIIMasking) {
  return entry; // Trust internal logs, skip expensive regex
}

Database schema:

-- Track masking metadata
CREATE TABLE pii_masking_stats (
  date DATE NOT NULL,
  pattern_type VARCHAR(50),
  occurrences INTEGER,
  source_id UUID REFERENCES sources(id),
  
  PRIMARY KEY (date, pattern_type, source_id)
);

-- For compliance auditing
CREATE TABLE pii_masking_audit (
  id UUID PRIMARY KEY,
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  pattern_type VARCHAR(50),
  action VARCHAR(20), -- 'mask', 'redact', 'hash'
  source_id UUID,
  log_id UUID
);

Priority

  • Critical - Blocking my usage of LogTide
  • High - Would significantly improve my workflow
  • Medium - Nice to have
  • Low - Minor enhancement

Rationale: Essential for GDPR compliance and enterprise adoption, but not blocking for most current users. Higher priority for EU market and regulated industries.

Target Users

  • DevOps Engineers (enforce compliance)
  • Developers (prevent accidental PII logging)
  • Security/SIEM Users (data protection)
  • System Administrators
  • All Users

Primary benefit: Organizations that handle user data and need GDPR/compliance guarantees.

Secondary benefit: Reduces data breach risk for everyone.

Additional Context

Why this is critical for growth:

1. GDPR compliance requirement

GDPR Article 5: Data minimization
→ "Personal data shall be adequate, relevant and limited to what is necessary"
→ Storing PII in logs violates this unless there's a specific reason

GDPR fines:
→ Up to €20 million or 4% of annual revenue (whichever is higher)
→ Real example: British Airways fined £20M for data breach

2. Market differentiation

Competitors:
• Datadog: Client-side masking only (can be bypassed)
• Elastic: Manual configuration (complex, error-prone)
• Splunk: Has PII detection but enterprise-tier only
• Grafana Loki: No built-in PII masking

Logtide advantage:
✓ Built-in, automatic detection
✓ Configurable patterns
✓ Works with any log source
✓ Free tier includes PII masking (not paywalled)

3. Enterprise sales enabler

Common enterprise question: "How do you handle PII?"

Without this feature:
❌ "You'll need to configure client-side masking in your SDKs"
→ Enterprise: "That's not acceptable" (lost deal)

With this feature:
✓ "Logtide automatically detects and masks PII at ingestion"
✓ "GDPR-compliant out of the box"
✓ "No code changes required"
→ Enterprise: "Perfect, when can we start?" (closed deal)

Real-world impact examples:

Example 1: Startup avoids GDPR violation

Scenario: Developer accidentally logs user emails in error messages
Without masking: GDPR violation, potential €50k fine
With masking: Emails automatically redacted, no violation

Example 2: Security breach damage limitation

Scenario: Attacker gains access to log database
Without masking: Full credit card numbers, emails, SSNs exposed
With masking: Only masked/hashed data visible (useless to attacker)

Marketing angles:

"GDPR-compliant by default. Logtide automatically protects sensitive data in your logs."

"Stop worrying about PII in logs. Logtide masks emails, credit cards, and phone numbers before storage."

Documentation needs:

  • PII masking configuration guide
  • GDPR compliance whitepaper
  • Best practices for sensitive data
  • Custom pattern examples
  • Performance impact notes

Blog post opportunity:

"The Hidden GDPR Risk in Your Logs (And How to Fix It)"
- Explain common PII logging mistakes
- Show GDPR requirements
- Demo Logtide's automatic masking
- Position as privacy-first

Future enhancements:

  • ML-based PII detection (detect new patterns automatically)
  • Industry-specific patterns (healthcare, finance)
  • Compliance reporting ("X emails masked this month")
  • Integration with DLP tools
  • Regional pattern variants (EU vs US phone numbers)

Contribution

  • I would like to work on implementing this feature

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions