Skip to content

[BUG] PPL dedup command fails on struct/object field paths #4619

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

# This succeeds
source=ocsf-1.1.0-4003 | dedup accountid, activity_name

# This fails with an error
source=ocsf-1.1.0-4003 | dedup accountid, src_endpoint.ip

Expected Result:
The query should return deduplicated results based on both accountid and src_endpoint.ip fields, similar to how the first query successfully deduplicates on flat fields.

Actual Result:
The query fails with an internal server error:

{
  "error": {
    "reason": "There was internal problem at backend",
    "details": "java.sql.SQLException: Error while preparing plan ...",
    "type": "RuntimeException"
  },
  "status": 500
}

Root Cause (from logs):

Caused by: org.codehaus.commons.compiler.CompileException: Line 3, Column 26: ';' expected instead of '.'

Dataset Information

Dataset/Schema Type

  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (minimal reproduction below)

Index Mapping

{
  "mappings": {
    "properties": {
      "account_id": {
        "type": "keyword"
      },
      "endpoint": {
        "properties": {
          "ip": {
            "type": "ip"
          },
          "port": {
            "type": "integer"
          }
        }
      }
    }
  }
}

Sample Data

{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}

Bug Description

Issue Summary:
The dedup command fails with a Java error when deduplicating on struct/object field paths (e.g., endpoint.ip). The error occurs during physical plan generation when Calcite attempts to compile Java code for the window function used internally by dedup.

Steps to Reproduce:

# 1. Create test index with struct field
curl -X PUT "http://localhost:9200/test-dedup-struct" -H 'Content-Type: application/json' -d '{
  "mappings": {
    "properties": {
      "account_id": {"type": "keyword"},
      "endpoint": {
        "properties": {
          "ip": {"type": "ip"},
          "port": {"type": "integer"}
        }
      }
    }
  }
}'

# 2. Insert test data
curl -X POST "http://localhost:9200/test-dedup-struct/_bulk" -H 'Content-Type: application/json' -d '
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"index":{}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}
'

curl -X POST "http://localhost:9200/test-dedup-struct/_refresh"

# 3. Test dedup with flat field (works)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-dedup-struct | dedup account_id"
}'
# Returns 2 results

# 4. Test dedup with struct field (fails)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-dedup-struct | dedup account_id, endpoint.ip"
}'
# Returns compilation error

Impact:
This bug prevents users from deduplicating on any struct/object field paths, which is critical for OCSF, OTEL, and other structured observability data that heavily uses nested structures.

Environment Information

OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (also affects 3.3.0)

Additional Details:

  • Affects Calcite query engine (V3)
  • Issue occurs with all struct field types (ip, keyword, integer, etc.)
  • Not specific to nested type - affects regular object/struct fields

Root Cause Analysis

Tentative Root Cause:

Note: This is a preliminary analysis and requires further investigation.

The dedup command internally uses Apache Calcite's window functions (ROW_NUMBER() OVER PARTITION BY) to implement deduplication logic. When struct field paths like endpoint.ip are used in the PARTITION BY clause, Calcite generates Java code that uses the field name directly as a variable identifier.

Code Flow:

  1. Logical Plan Generation (CalciteRelNodeVisitor.java:1409-1428):

    List<RexNode> dedupeFields =
        node.getFields().stream().map(f -> rexVisitor.analyze(f, context)).toList();
    // ...
    context.relBuilder.aggregateCall(SqlStdOperatorTable.ROW_NUMBER)
        .over()
        .partitionBy(dedupeFields)  // Contains RexNode for "endpoint.ip"
        .orderBy(dedupeFields)
  2. Field Resolution (QualifiedNameResolver.java:250-254):

    • Struct field paths are resolved using INTERNAL_ITEM function
    • Creates field references like $1 (for endpoint.ip)
    • Field names retain dots in metadata
  3. Physical Plan Generation:

    • Calcite generates Java code for window function execution
    • Uses field names directly from row type metadata
    • Generates invalid Java: int endpoint.ip = ...
  4. Compilation Failure:

    • Janino compiler sees the dot as field access operator
    • Expects semicolon, finds dot instead
    • Throws: ';' expected instead of '.'

Verification:

The logical plan shows correct structure:

LogicalProject(...)
  LogicalFilter(condition=[<=($10, 1)])
    LogicalProject(..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $3, $1 ORDER BY $3, $1)])
      LogicalFilter(condition=[AND(IS NOT NULL($3), IS NOT NULL($1))])
        CalciteLogicalIndexScan(table=[[OpenSearch, test-dedup-struct]])

But physical plan generation fails because $1 maps to field name endpoint.ip which is invalid in generated Java code.

Workaround

No workaround available. Users cannot use dedup with struct field paths. The only alternative is to:

Related Issues

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't workingcalcitecalcite migration releated

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions