Skip to content

[BUG] isnotnull() filter not applied in aggregation pushdown with text fields having keyword subfields #4463

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description

Expected Result:
Results should not contain null values for the description field when isnotnull(description) filter is applied.

Actual Result:

{
  "datarows": [
    [3, null],
    [1, "Short description 1"],
    [1, "Short description 2"],
    [1, "Short description 3"]
  ]
}

The first row contains null for description despite the isnotnull(description) filter.

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 50
          }
        }
      },
      "value": {
        "type": "long"
      }
    }
  }
}

Sample Data

{"description": "Short description 1", "value": 1}
{"description": "This is a very long description that definitely exceeds the 50 character limit set in ignore_above", "value": 100}
{"description": "Another extremely long description that will be ignored by the keyword subfield due to exceeding limit", "value": 101}
{"description": "", "value": 200}

Bug Description

Issue Summary

PPL query with where isnotnull(field) filter incorrectly returns null values in aggregation results when the field is a text type with a keyword subfield that has an ignore_above limit, and documents contain values exceeding that limit.

Preconditions for Bug to Occur

The bug requires ALL of the following conditions:

  1. Field Mapping Structure:

    • Field must be text type with a keyword subfield
    • The keyword subfield must have ignore_above setting
    {
      "field_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": N
          }
        }
      }
    }
  2. Data Condition:

    • Documents must exist where the field value exceeds the ignore_above character limit
    • For these documents:
      • The text field is populated (has a value)
      • The keyword subfield is null (ignored due to length)
  3. Query Pattern:

    • PPL query with isnotnull() filter on the field
    • Aggregation (stats ... by) on the same field
    source=index | where isnotnull(field) | stats count() by field
    

When Bug Will NOT Occur

The bug will NOT reproduce if:

  1. Simple keyword field (no text parent):

    {"description": {"type": "keyword"}}

    ❌ No bug - filter and aggregation use same field

  2. No ignore_above limit:

    {
      "description": {
        "type": "text",
        "fields": {"keyword": {"type": "keyword"}}
      }
    }

    ❌ No bug - keyword subfield always populated

  3. All values within limit:

    • All descriptions < 50 characters (if ignore_above: 50)
      ❌ No bug - keyword subfield populated for all docs
  4. Truly null/missing fields:

    {"description": null}
    {"value": 100}

    ❌ No bug - both text and keyword fields are null

Steps to Reproduce

# 1. Create index with text field + keyword subfield with ignore_above
PUT /test-isnotnull-bug
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "keyword": {"type": "keyword", "ignore_above": 50}
        }
      }
    }
  }
}

# 2. Insert data with descriptions exceeding 50 characters
POST /test-isnotnull-bug/_bulk
{"index":{}}
{"description":"Short description 1","value":1}
{"index":{}}
{"description":"This is a very long description that definitely exceeds the 50 character limit","value":100}

# 3. Execute PPL query
POST /_plugins/_ppl
{
  "query": "source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description"
}

# 4. Observe null values in results despite isnotnull() filter

Root Cause Analysis

Explain Plan Output:

Physical Plan (OpenSearch DSL):
{
  "query": {
    "bool": {
      "must": [{"exists": {"field": "description"}}],
      "must_not": [{"term": {"description.keyword": {"value": ""}}}]
    }
  },
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "sources": [{
          "description": {
            "terms": {
              "field": "description.keyword",
              "missing_bucket": true,
              "missing_order": "first"
            }
          }
        }]
      }
    }
  }
}

The Problem:

  1. Filter operates on: description (text field)

    • isnotnull(description){"exists": {"field": "description"}}
    • Documents with long descriptions PASS this filter (text field exists)
  2. Aggregation operates on: description.keyword (keyword subfield)

    • Aggregates by "field": "description.keyword"
    • For documents with descriptions > 50 chars, description.keyword is NULL due to ignore_above: 50
  3. The Mismatch:

    • 3 documents have description text field populated → filter passes ✓
    • But their description.keyword is null → aggregation sees null ✗
    • With missing_bucket: true, these appear as [3, null] in results

Verification:

# Confirm: 3 documents have description but no description.keyword
POST /test-isnotnull-bug/_count
{
  "query": {
    "bool": {
      "must": [{"exists": {"field": "description"}}],
      "must_not": [{"exists": {"field": "description.keyword"}}]
    }
  }
}
# Returns: {"count": 3}

Tentative Root Cause

In /opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java (lines 207-215), the bucketNullable flag defaults to true:

boolean bucketNullable =
    Boolean.parseBoolean(
        aggregate.getHints().stream()
            .filter(hits -> hits.hintName.equals("stats_args"))
            .map(hint -> hint.kvOptions.getOrDefault(Argument.BUCKET_NULLABLE, "true"))
            .findFirst()
            .orElseGet(() -> "true"));

This flag controls missing_bucket in /opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/aggregation/dsl/CompositeAggregationBuilder.java (lines 72-74):

if (bucketNullable) {
  sourceBuilder.missingBucket(true).missingOrder(missingOrder);
}

The system does not analyze WHERE clause filters (specifically isnotnull()) to automatically set bucketNullable=false for fields with non-null constraints.

Tentative Proposed Fix

Enhance the aggregation analyzer to detect isnotnull() filters on fields being aggregated:

  1. In AggregateAnalyzer.analyze(), extract fields with isnotnull() predicates from the Filter node
  2. Pass this information to createTermsSourceBuilder()
  3. Set missingBucket(false) when the aggregation field has an isnotnull() filter

Suggested Code Change:

// Extract non-nullable fields from filter predicates
private static Set<String> extractNonNullableFields(Filter filterNode) {
  // Parse filter conditions to find isnotnull() predicates
  // Return set of field names that should not have missing buckets
}

// In createTermsSourceBuilder()
private static CompositeValuesSourceBuilder<?> createTermsSourceBuilder(
    String bucketName, RexNode group, AggregateBuilderHelper helper, 
    Set<String> nonNullableFields) {
  TermsValuesSourceBuilder termsBuilder = 
      new TermsValuesSourceBuilder(bucketName).order(SortOrder.ASC);
  
  String fieldName = extractFieldName(group);
  boolean shouldIncludeMissing = helper.bucketNullable 
      && !nonNullableFields.contains(fieldName);
  
  if (shouldIncludeMissing) {
    termsBuilder.missingBucket(true).missingOrder(MissingOrder.FIRST);
  }
  
  return helper.build(group, termsBuilder);
}

Workaround

  1. Increase ignore_above limit to accommodate all expected values
  2. Use keyword field directly if text analysis is not needed
  3. Post-filter null values in application code
  4. Remove ignore_above if all values should be indexed

Environment Information

OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT

Additional Details:

Impact

This bug violates the semantic meaning of isnotnull() filter in aggregation queries, leading to incorrect query results. It particularly affects:

  • OpenTelemetry metrics with long descriptions (common scenario)
  • Any text fields with keyword subfields using ignore_above
  • Data analysis and reporting where null exclusion is critical
  • Dashboard visualizations that rely on accurate null filtering

The bug is subtle because the filter appears to work (exists check passes) but the aggregation produces incorrect results due to the field type mismatch between text and keyword subfields.

Metadata

Metadata

Assignees

Labels

aggregationbugSomething isn't workingpushdownpushdown related issues

Type

No type

Projects

Status

Won't do

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions