-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Problem Summary
PPL query anonymization is inconsistent with SQL query anonymization when handling aggregation function field parameters. Field names inside aggregation functions are preserved in PPL but properly anonymized in SQL, creating a security inconsistency.
Current Behavior
SQL Anonymizer (Correct)
Field names inside aggregation functions are anonymized to identifier
:
-- Original Query
SELECT MAX(price) - MIN(price) FROM tickets
-- Anonymized Output
SELECT MAX ( identifier ) - MIN ( identifier ) FROM table
-- Original Query
SELECT SUM(balance) FROM accounts GROUP BY lastname HAVING COUNT(balance) > 2
-- Anonymized Output
SELECT SUM ( identifier ) FROM table GROUP BY identifier HAVING COUNT ( identifier ) > number
PPL Anonymizer (Inconsistent)
Field names inside aggregation functions are preserved as-is:
# Original Query
source=t | stats count(a) by b
# Anonymized Output (field 'a' and 'b' preserved)
source=t | stats count(a) by b
# Original Query
source=t | stats count(), values(gender), avg(age) by employer
# Anonymized Output (all field names preserved)
source=t | stats count(),values(gender),avg(age) by employer
Impact
Field names contain sensitive information that should be anonymized for:
- Security compliance requirements
- Privacy protection in logs and monitoring
- Consistent anonymization across SQL and PPL interfaces
The current PPL behavior potentially exposes sensitive field names in logs, monitoring systems, and error messages where anonymized queries are used.
Affected Functions
All PPL aggregation functions are affected, including but not limited to:
count(field)
sum(field)
avg(field)
max(field)
min(field)
values(field)
list(field)
distinct_count(field)
Root Cause
Location: ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
The visitField()
method returns field names as-is without anonymization:
@Override
public String visitField(Field node, String context) {
return node.getField().toString(); // Returns actual field name
}
The visitAggregateFunction()
method calls visitField()
but doesn't provide anonymization context:
@Override
public String visitAggregateFunction(AggregateFunction node, String context) {
String arg = node.getField().accept(this, context); // Calls visitField()
return StringUtils.format("%s(%s)", node.getFuncName(), arg);
}
Expected Behavior
PPL aggregation function field parameters should be anonymized consistently with SQL:
# Original Query
source=t | stats count(a), sum(balance), values(gender) by state
# Expected Anonymized Output
source=t | stats count(***), sum(***), values(***) by ***
Solution Approach
- Modify
visitField()
method to return***
when called from aggregation function context - Add context tracking to know when we're inside an aggregation function
- Update
visitAggregateFunction()
to set appropriate context before visiting field - Update all PPL anonymizer tests to expect anonymized field names in aggregation functions
Test Impact
Breaking change: All existing PPL anonymizer tests that use aggregation functions will need their expected outputs updated to use ***
instead of actual field names.
Files affected:
PPLQueryDataAnonymizerTest.java
- Multiple test methods expecting field preservation- All integration tests using PPL query anonymization
Code References
- PPL Anonymizer:
ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java:visitField()
- SQL Anonymizer:
sql/src/main/java/org/opensearch/sql/sql/antlr/AnonymizerListener.java
(working correctly) - PPL Tests:
ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java:489-524
Metadata
Metadata
Assignees
Labels
Type
Projects
Status