-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Query Information
PPL Command/Query:
source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description
Expected Result:
Results should not contain null values for the description
field when isnotnull(description)
filter is applied.
Actual Result:
{
"datarows": [
[3, null],
[1, "Short description 1"],
[1, "Short description 2"],
[1, "Short description 3"]
]
}
The first row contains null
for description despite the isnotnull(description)
filter.
Dataset Information
Dataset/Schema Type
- OpenTelemetry (OTEL)
- Simple Schema for Observability (SS4O)
- Open Cybersecurity Schema Framework (OCSF)
- Custom (details below)
Index Mapping
{
"mappings": {
"properties": {
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 50
}
}
},
"value": {
"type": "long"
}
}
}
}
Sample Data
{"description": "Short description 1", "value": 1}
{"description": "This is a very long description that definitely exceeds the 50 character limit set in ignore_above", "value": 100}
{"description": "Another extremely long description that will be ignored by the keyword subfield due to exceeding limit", "value": 101}
{"description": "", "value": 200}
Bug Description
Issue Summary
PPL query with where isnotnull(field)
filter incorrectly returns null values in aggregation results when the field is a text
type with a keyword
subfield that has an ignore_above
limit, and documents contain values exceeding that limit.
Preconditions for Bug to Occur
The bug requires ALL of the following conditions:
-
Field Mapping Structure:
- Field must be
text
type with akeyword
subfield - The
keyword
subfield must haveignore_above
setting
{ "field_name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": N } } } }
- Field must be
-
Data Condition:
- Documents must exist where the field value exceeds the
ignore_above
character limit - For these documents:
- The
text
field is populated (has a value) - The
keyword
subfield is null (ignored due to length)
- The
- Documents must exist where the field value exceeds the
-
Query Pattern:
- PPL query with
isnotnull()
filter on the field - Aggregation (
stats ... by
) on the same field
source=index | where isnotnull(field) | stats count() by field
- PPL query with
When Bug Will NOT Occur
The bug will NOT reproduce if:
-
Simple keyword field (no text parent):
{"description": {"type": "keyword"}}
❌ No bug - filter and aggregation use same field
-
No ignore_above limit:
{ "description": { "type": "text", "fields": {"keyword": {"type": "keyword"}} } }
❌ No bug - keyword subfield always populated
-
All values within limit:
- All descriptions < 50 characters (if
ignore_above: 50
)
❌ No bug - keyword subfield populated for all docs
- All descriptions < 50 characters (if
-
Truly null/missing fields:
{"description": null} {"value": 100}
❌ No bug - both text and keyword fields are null
Steps to Reproduce
# 1. Create index with text field + keyword subfield with ignore_above
PUT /test-isnotnull-bug
{
"mappings": {
"properties": {
"description": {
"type": "text",
"fields": {
"keyword": {"type": "keyword", "ignore_above": 50}
}
}
}
}
}
# 2. Insert data with descriptions exceeding 50 characters
POST /test-isnotnull-bug/_bulk
{"index":{}}
{"description":"Short description 1","value":1}
{"index":{}}
{"description":"This is a very long description that definitely exceeds the 50 character limit","value":100}
# 3. Execute PPL query
POST /_plugins/_ppl
{
"query": "source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description"
}
# 4. Observe null values in results despite isnotnull() filter
Root Cause Analysis
Explain Plan Output:
Physical Plan (OpenSearch DSL):
{
"query": {
"bool": {
"must": [{"exists": {"field": "description"}}],
"must_not": [{"term": {"description.keyword": {"value": ""}}}]
}
},
"aggregations": {
"composite_buckets": {
"composite": {
"sources": [{
"description": {
"terms": {
"field": "description.keyword",
"missing_bucket": true,
"missing_order": "first"
}
}
}]
}
}
}
}
The Problem:
-
Filter operates on:
description
(text field)isnotnull(description)
→{"exists": {"field": "description"}}
- Documents with long descriptions PASS this filter (text field exists)
-
Aggregation operates on:
description.keyword
(keyword subfield)- Aggregates by
"field": "description.keyword"
- For documents with descriptions > 50 chars,
description.keyword
is NULL due toignore_above: 50
- Aggregates by
-
The Mismatch:
- 3 documents have
description
text field populated → filter passes ✓ - But their
description.keyword
is null → aggregation sees null ✗ - With
missing_bucket: true
, these appear as[3, null]
in results
- 3 documents have
Verification:
# Confirm: 3 documents have description but no description.keyword
POST /test-isnotnull-bug/_count
{
"query": {
"bool": {
"must": [{"exists": {"field": "description"}}],
"must_not": [{"exists": {"field": "description.keyword"}}]
}
}
}
# Returns: {"count": 3}
Tentative Root Cause
In /opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java
(lines 207-215), the bucketNullable
flag defaults to true
:
boolean bucketNullable =
Boolean.parseBoolean(
aggregate.getHints().stream()
.filter(hits -> hits.hintName.equals("stats_args"))
.map(hint -> hint.kvOptions.getOrDefault(Argument.BUCKET_NULLABLE, "true"))
.findFirst()
.orElseGet(() -> "true"));
This flag controls missing_bucket
in /opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/aggregation/dsl/CompositeAggregationBuilder.java
(lines 72-74):
if (bucketNullable) {
sourceBuilder.missingBucket(true).missingOrder(missingOrder);
}
The system does not analyze WHERE clause filters (specifically isnotnull()
) to automatically set bucketNullable=false
for fields with non-null constraints.
Tentative Proposed Fix
Enhance the aggregation analyzer to detect isnotnull()
filters on fields being aggregated:
- In
AggregateAnalyzer.analyze()
, extract fields withisnotnull()
predicates from the Filter node - Pass this information to
createTermsSourceBuilder()
- Set
missingBucket(false)
when the aggregation field has anisnotnull()
filter
Suggested Code Change:
// Extract non-nullable fields from filter predicates
private static Set<String> extractNonNullableFields(Filter filterNode) {
// Parse filter conditions to find isnotnull() predicates
// Return set of field names that should not have missing buckets
}
// In createTermsSourceBuilder()
private static CompositeValuesSourceBuilder<?> createTermsSourceBuilder(
String bucketName, RexNode group, AggregateBuilderHelper helper,
Set<String> nonNullableFields) {
TermsValuesSourceBuilder termsBuilder =
new TermsValuesSourceBuilder(bucketName).order(SortOrder.ASC);
String fieldName = extractFieldName(group);
boolean shouldIncludeMissing = helper.bucketNullable
&& !nonNullableFields.contains(fieldName);
if (shouldIncludeMissing) {
termsBuilder.missingBucket(true).missingOrder(MissingOrder.FIRST);
}
return helper.build(group, termsBuilder);
}
Workaround
- Increase
ignore_above
limit to accommodate all expected values - Use keyword field directly if text analysis is not needed
- Post-filter null values in application code
- Remove
ignore_above
if all values should be indexed
Environment Information
OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT
Additional Details:
- Issue confirmed reproducible with minimal test case
- Related to PR Date/Time based Span aggregation should always not present null bucket #4327 which addressed similar issues for date/time span aggregations
- The logical plan does not show the
isnotnull()
condition explicitly, only<>($2, '')
, but the physical plan correctly includes theexists
filter
Impact
This bug violates the semantic meaning of isnotnull()
filter in aggregation queries, leading to incorrect query results. It particularly affects:
- OpenTelemetry metrics with long descriptions (common scenario)
- Any text fields with keyword subfields using
ignore_above
- Data analysis and reporting where null exclusion is critical
- Dashboard visualizations that rely on accurate null filtering
The bug is subtle because the filter appears to work (exists check passes) but the aggregation produces incorrect results due to the field type mismatch between text and keyword subfields.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status