Skip to content

[BUG] PPL Aggregation Function Field Anonymization Issue #4125

@ps48

Description

@ps48

Problem Summary

PPL query anonymization is inconsistent with SQL query anonymization when handling aggregation function field parameters. Field names inside aggregation functions are preserved in PPL but properly anonymized in SQL, creating a security inconsistency.

Current Behavior

SQL Anonymizer (Correct)

Field names inside aggregation functions are anonymized to identifier:

-- Original Query
SELECT MAX(price) - MIN(price) FROM tickets

-- Anonymized Output  
SELECT MAX ( identifier ) - MIN ( identifier ) FROM table
-- Original Query
SELECT SUM(balance) FROM accounts GROUP BY lastname HAVING COUNT(balance) > 2

-- Anonymized Output
SELECT SUM ( identifier ) FROM table GROUP BY identifier HAVING COUNT ( identifier ) > number

PPL Anonymizer (Inconsistent)

Field names inside aggregation functions are preserved as-is:

# Original Query
source=t | stats count(a) by b

# Anonymized Output (field 'a' and 'b' preserved)
source=t | stats count(a) by b
# Original Query  
source=t | stats count(), values(gender), avg(age) by employer

# Anonymized Output (all field names preserved)
source=t | stats count(),values(gender),avg(age) by employer

Impact

Field names contain sensitive information that should be anonymized for:

  • Security compliance requirements
  • Privacy protection in logs and monitoring
  • Consistent anonymization across SQL and PPL interfaces

The current PPL behavior potentially exposes sensitive field names in logs, monitoring systems, and error messages where anonymized queries are used.

Affected Functions

All PPL aggregation functions are affected, including but not limited to:

  • count(field)
  • sum(field)
  • avg(field)
  • max(field)
  • min(field)
  • values(field)
  • list(field)
  • distinct_count(field)

Root Cause

Location: ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java

The visitField() method returns field names as-is without anonymization:

@Override
public String visitField(Field node, String context) {
  return node.getField().toString(); // Returns actual field name
}

The visitAggregateFunction() method calls visitField() but doesn't provide anonymization context:

@Override
public String visitAggregateFunction(AggregateFunction node, String context) {
  String arg = node.getField().accept(this, context); // Calls visitField() 
  return StringUtils.format("%s(%s)", node.getFuncName(), arg);
}

Expected Behavior

PPL aggregation function field parameters should be anonymized consistently with SQL:

# Original Query
source=t | stats count(a), sum(balance), values(gender) by state

# Expected Anonymized Output  
source=t | stats count(***), sum(***), values(***) by ***

Solution Approach

  1. Modify visitField() method to return *** when called from aggregation function context
  2. Add context tracking to know when we're inside an aggregation function
  3. Update visitAggregateFunction() to set appropriate context before visiting field
  4. Update all PPL anonymizer tests to expect anonymized field names in aggregation functions

Test Impact

Breaking change: All existing PPL anonymizer tests that use aggregation functions will need their expected outputs updated to use *** instead of actual field names.

Files affected:

  • PPLQueryDataAnonymizerTest.java - Multiple test methods expecting field preservation
  • All integration tests using PPL query anonymization

Code References

  • PPL Anonymizer: ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java:visitField()
  • SQL Anonymizer: sql/src/main/java/org/opensearch/sql/sql/antlr/AnonymizerListener.java (working correctly)
  • PPL Tests: ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java:489-524

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions