-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Query Information
PPL Command/Query:
# This succeeds
source=ocsf-1.1.0-4003 | dedup accountid, activity_name
# This fails with an error
source=ocsf-1.1.0-4003 | dedup accountid, src_endpoint.ip
Expected Result:
The query should return deduplicated results based on both accountid and src_endpoint.ip fields, similar to how the first query successfully deduplicates on flat fields.
Actual Result:
The query fails with an internal server error:
{
"error": {
"reason": "There was internal problem at backend",
"details": "java.sql.SQLException: Error while preparing plan ...",
"type": "RuntimeException"
},
"status": 500
}Root Cause (from logs):
Caused by: org.codehaus.commons.compiler.CompileException: Line 3, Column 26: ';' expected instead of '.'
Dataset Information
Dataset/Schema Type
- Open Cybersecurity Schema Framework (OCSF)
- Custom (minimal reproduction below)
Index Mapping
{
"mappings": {
"properties": {
"account_id": {
"type": "keyword"
},
"endpoint": {
"properties": {
"ip": {
"type": "ip"
},
"port": {
"type": "integer"
}
}
}
}
}
}Sample Data
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}Bug Description
Issue Summary:
The dedup command fails with a Java error when deduplicating on struct/object field paths (e.g., endpoint.ip). The error occurs during physical plan generation when Calcite attempts to compile Java code for the window function used internally by dedup.
Steps to Reproduce:
# 1. Create test index with struct field
curl -X PUT "http://localhost:9200/test-dedup-struct" -H 'Content-Type: application/json' -d '{
"mappings": {
"properties": {
"account_id": {"type": "keyword"},
"endpoint": {
"properties": {
"ip": {"type": "ip"},
"port": {"type": "integer"}
}
}
}
}
}'
# 2. Insert test data
curl -X POST "http://localhost:9200/test-dedup-struct/_bulk" -H 'Content-Type: application/json' -d '
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"index":{}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}
'
curl -X POST "http://localhost:9200/test-dedup-struct/_refresh"
# 3. Test dedup with flat field (works)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
"query": "source=test-dedup-struct | dedup account_id"
}'
# Returns 2 results
# 4. Test dedup with struct field (fails)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
"query": "source=test-dedup-struct | dedup account_id, endpoint.ip"
}'
# Returns compilation errorImpact:
This bug prevents users from deduplicating on any struct/object field paths, which is critical for OCSF, OTEL, and other structured observability data that heavily uses nested structures.
Environment Information
OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (also affects 3.3.0)
Additional Details:
- Affects Calcite query engine (V3)
- Issue occurs with all struct field types (ip, keyword, integer, etc.)
- Not specific to nested type - affects regular object/struct fields
Root Cause Analysis
Tentative Root Cause:
Note: This is a preliminary analysis and requires further investigation.
The dedup command internally uses Apache Calcite's window functions (ROW_NUMBER() OVER PARTITION BY) to implement deduplication logic. When struct field paths like endpoint.ip are used in the PARTITION BY clause, Calcite generates Java code that uses the field name directly as a variable identifier.
Code Flow:
-
Logical Plan Generation (
CalciteRelNodeVisitor.java:1409-1428):List<RexNode> dedupeFields = node.getFields().stream().map(f -> rexVisitor.analyze(f, context)).toList(); // ... context.relBuilder.aggregateCall(SqlStdOperatorTable.ROW_NUMBER) .over() .partitionBy(dedupeFields) // Contains RexNode for "endpoint.ip" .orderBy(dedupeFields)
-
Field Resolution (
QualifiedNameResolver.java:250-254):- Struct field paths are resolved using
INTERNAL_ITEMfunction - Creates field references like
$1(forendpoint.ip) - Field names retain dots in metadata
- Struct field paths are resolved using
-
Physical Plan Generation:
- Calcite generates Java code for window function execution
- Uses field names directly from row type metadata
- Generates invalid Java:
int endpoint.ip = ...
-
Compilation Failure:
- Janino compiler sees the dot as field access operator
- Expects semicolon, finds dot instead
- Throws:
';' expected instead of '.'
Verification:
The logical plan shows correct structure:
LogicalProject(...)
LogicalFilter(condition=[<=($10, 1)])
LogicalProject(..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $3, $1 ORDER BY $3, $1)])
LogicalFilter(condition=[AND(IS NOT NULL($3), IS NOT NULL($1))])
CalciteLogicalIndexScan(table=[[OpenSearch, test-dedup-struct]])
But physical plan generation fails because $1 maps to field name endpoint.ip which is invalid in generated Java code.
Workaround
No workaround available. Users cannot use dedup with struct field paths. The only alternative is to:
Related Issues
- [BUG] PPL dedup command returns zero results on nested field paths #4487 - PPL dedup returns zero results on nested fields (different root cause - IS NOT NULL filter issue)
- [FEATURE] Support nested (object) fields with Calcite #3452 - Support nested (object) fields with Calcite (parent feature request)
- [BUG] Rename does not work with nested fields #2740 - Rename does not work with nested fields (related struct field handling issue)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status