Skip to content

[BUG] PPL dedup command returns zero results on nested field paths #4487

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

source=jaeger-span-2024-05-08 | dedup references.refType

Expected Result:
The query should return deduplicated results based on the references.refType field values (e.g., "CHILD_OF", "FOLLOWS_FROM", and null), similar to how source=jaeger-span-2024-05-08 | fields references.refType successfully returns 4226 results with these values.

Actual Result:
The query returns 0 results:

{
  "schema": [...],
  "datarows": [],
  "total": 0,
  "size": 0
}

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "references": {
        "type": "nested",
        "dynamic": "false",
        "properties": {
          "refType": {
            "type": "keyword",
            "ignore_above": 256
          },
          "spanID": {
            "type": "keyword",
            "ignore_above": 256
          },
          "traceID": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Sample Data

{
  "traceID": "3c61c828d40ae3c8b553f0ba2d185898",
  "spanID": "1dc84d1a2920a6bf",
  "operationName": "oteldemo.ProductCatalogService/GetProduct",
  "references": [
    {
      "refType": "CHILD_OF",
      "traceID": "3c61c828d40ae3c8b553f0ba2d185898",
      "spanID": "047eded12fcda381"
    }
  ]
}

Bug Description

Issue Summary:
The dedup command returns zero results when applied to fields within nested objects, even though the same field can be successfully queried using the fields command.

Steps to Reproduce:

  1. Create an index with a nested field mapping (e.g., Jaeger span data with nested references field)
  2. Insert documents containing nested objects with the target field populated
  3. Run source=<index> | fields <nested_field> - this works correctly
  4. Run source=<index> | dedup <nested_field> - this returns 0 results

Minimal Reproduction:

# Create test index
curl -X PUT "localhost:9200/test-nested-dedup" -H 'Content-Type: application/json' -d '{
  "mappings": {
    "properties": {
      "items": {
        "type": "nested",
        "properties": {
          "name": {"type": "keyword"}
        }
      }
    }
  }
}'

# Insert test data
curl -X POST "localhost:9200/test-nested-dedup/_bulk" -H 'Content-Type: application/json' -d '
{"index":{"_id":"1"}}
{"items":[{"name":"apple"}]}
{"index":{"_id":"2"}}
{"items":[{"name":"banana"}]}
'

# Refresh index
curl -X POST "localhost:9200/test-nested-dedup/_refresh"

# This works - returns 2 results
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-nested-dedup | fields items.name"
}'

# This fails - returns 0 results
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-nested-dedup | dedup items.name"
}'

Impact:
This bug prevents users from deduplicating results based on fields within nested objects.

Environment Information

OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT

Additional Details:
The issue occurs with the Calcite-based query engine (V3).

Technical Analysis

Tentative Root Cause:

Note: This is a preliminary analysis and requires further investigation.

The dedup command internally generates an IS NOT NULL filter on the dedup field to exclude null values before applying the deduplication logic. The query execution plan shows:

LogicalFilter(condition=[IS NOT NULL($42)])

This filter is then converted to an OpenSearch exists query:

{
  "query": {
    "exists": {
      "field": "references.refType"
    }
  }
}

However, for nested fields in OpenSearch, an exists query on a nested field path (e.g., references.refType) must be wrapped in a nested query to function correctly:

{
  "query": {
    "nested": {
      "path": "references",
      "query": {
        "exists": {
          "field": "references.refType"
        }
      }
    }
  }
}

Verification:

  • Direct exists query on references.refType: 0 results
  • exists query wrapped in nested query: 2565 results (correct)

Code Location:
The issue appears to be in /opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java around lines 577-586:

// OpenSearch DSL does not handle IS_NULL / IS_NOT_NULL on nested fields correctly
checkForNestedFieldOperands(call);

Expression a = call.getOperands().get(0).accept(this);
QueryExpression operand = QueryExpression.create((TerminalExpression) a);
return call.getKind() == SqlKind.IS_NOT_NULL ? operand.exists() : operand.notExists();

The code has a comment acknowledging the limitation and includes a check (checkForNestedFieldOperands) that throws an exception for nested fields. However, this check only detects when the operand type is ArraySqlType (the nested array itself), not when accessing a field within a nested object (e.g., references.refType).

Tentative Proposed Fix:

Note: This is a preliminary analysis and requires further investigation.

The fix would require:

  1. Enhanced Detection: Modify the nested field detection logic to identify field paths that traverse nested objects (e.g., references.refType where references is nested).

  2. Nested Query Wrapping: When generating an exists query for a field within a nested object, wrap it in a nested query with the appropriate path.

  3. Implementation Approach:

    • Extract the nested path from the field name (e.g., references from references.refType)
    • Check if this path corresponds to a nested field in the index mapping
    • If yes, generate a nested query wrapper around the exists query

Example Fix Location:
In PredicateAnalyzer.java, the postfix() method would need to:

  • Detect if the field is within a nested object
  • Generate an appropriate nested query DSL instead of a simple exists query

Note: This analysis is preliminary. A complete fix would require further investigation!

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpushdownpushdown related issues

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions