[BUG] isnotnull() filter not applied in aggregation pushdown with text fields having keyword subfields

## Query Information
**PPL Command/Query:**
```ppl
source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description
```

**Expected Result:**
Results should not contain null values for the `description` field when `isnotnull(description)` filter is applied.

**Actual Result:**
```json
{
  "datarows": [
    [3, null],
    [1, "Short description 1"],
    [1, "Short description 2"],
    [1, "Short description 3"]
  ]
}
```
The first row contains `null` for description despite the `isnotnull(description)` filter.

## Dataset Information
**Dataset/Schema Type**
- [ ] OpenTelemetry (OTEL)
- [ ] Simple Schema for Observability (SS4O)
- [ ] Open Cybersecurity Schema Framework (OCSF)
- [x] Custom (details below)

**Index Mapping**
```json
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 50
          }
        }
      },
      "value": {
        "type": "long"
      }
    }
  }
}
```

**Sample Data**
```json
{"description": "Short description 1", "value": 1}
{"description": "This is a very long description that definitely exceeds the 50 character limit set in ignore_above", "value": 100}
{"description": "Another extremely long description that will be ignored by the keyword subfield due to exceeding limit", "value": 101}
{"description": "", "value": 200}
```

## Bug Description

### Issue Summary
PPL query with `where isnotnull(field)` filter incorrectly returns null values in aggregation results when the field is a `text` type with a `keyword` subfield that has an `ignore_above` limit, and documents contain values exceeding that limit.

### Preconditions for Bug to Occur

The bug requires **ALL** of the following conditions:

1. **Field Mapping Structure**:
   - Field must be `text` type with a `keyword` subfield
   - The `keyword` subfield must have `ignore_above` setting
   ```json
   {
     "field_name": {
       "type": "text",
       "fields": {
         "keyword": {
           "type": "keyword",
           "ignore_above": N
         }
       }
     }
   }
   ```

2. **Data Condition**:
   - Documents must exist where the field value **exceeds** the `ignore_above` character limit
   - For these documents:
     - The `text` field is populated (has a value)
     - The `keyword` subfield is null (ignored due to length)

3. **Query Pattern**:
   - PPL query with `isnotnull()` filter on the field
   - Aggregation (`stats ... by`) on the same field
   ```ppl
   source=index | where isnotnull(field) | stats count() by field
   ```

### When Bug Will NOT Occur

The bug will **NOT** reproduce if:

1. **Simple keyword field** (no text parent):
   ```json
   {"description": {"type": "keyword"}}
   ```
   ❌ No bug - filter and aggregation use same field

2. **No ignore_above limit**:
   ```json
   {
     "description": {
       "type": "text",
       "fields": {"keyword": {"type": "keyword"}}
     }
   }
   ```
   ❌ No bug - keyword subfield always populated

3. **All values within limit**:
   - All descriptions < 50 characters (if `ignore_above: 50`)
   ❌ No bug - keyword subfield populated for all docs

4. **Truly null/missing fields**:
   ```json
   {"description": null}
   {"value": 100}
   ```
   ❌ No bug - both text and keyword fields are null

### Steps to Reproduce

```bash
# 1. Create index with text field + keyword subfield with ignore_above
PUT /test-isnotnull-bug
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "keyword": {"type": "keyword", "ignore_above": 50}
        }
      }
    }
  }
}

# 2. Insert data with descriptions exceeding 50 characters
POST /test-isnotnull-bug/_bulk
{"index":{}}
{"description":"Short description 1","value":1}
{"index":{}}
{"description":"This is a very long description that definitely exceeds the 50 character limit","value":100}

# 3. Execute PPL query
POST /_plugins/_ppl
{
  "query": "source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description"
}

# 4. Observe null values in results despite isnotnull() filter
```

### Root Cause Analysis

**Explain Plan Output:**
```
Physical Plan (OpenSearch DSL):
{
  "query": {
    "bool": {
      "must": [{"exists": {"field": "description"}}],
      "must_not": [{"term": {"description.keyword": {"value": ""}}}]
    }
  },
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "sources": [{
          "description": {
            "terms": {
              "field": "description.keyword",
              "missing_bucket": true,
              "missing_order": "first"
            }
          }
        }]
      }
    }
  }
}
```

**The Problem:**

1. **Filter operates on**: `description` (text field)
   - `isnotnull(description)` → `{"exists": {"field": "description"}}`
   - Documents with long descriptions **PASS** this filter (text field exists)

2. **Aggregation operates on**: `description.keyword` (keyword subfield)
   - Aggregates by `"field": "description.keyword"`
   - For documents with descriptions > 50 chars, `description.keyword` is **NULL** due to `ignore_above: 50`

3. **The Mismatch**:
   - 3 documents have `description` text field populated → filter passes ✓
   - But their `description.keyword` is null → aggregation sees null ✗
   - With `missing_bucket: true`, these appear as `[3, null]` in results

**Verification:**
```bash
# Confirm: 3 documents have description but no description.keyword
POST /test-isnotnull-bug/_count
{
  "query": {
    "bool": {
      "must": [{"exists": {"field": "description"}}],
      "must_not": [{"exists": {"field": "description.keyword"}}]
    }
  }
}
# Returns: {"count": 3}
```

### Tentative Root Cause

In `/opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java` (lines 207-215), the `bucketNullable` flag defaults to `true`:

```java
boolean bucketNullable =
    Boolean.parseBoolean(
        aggregate.getHints().stream()
            .filter(hits -> hits.hintName.equals("stats_args"))
            .map(hint -> hint.kvOptions.getOrDefault(Argument.BUCKET_NULLABLE, "true"))
            .findFirst()
            .orElseGet(() -> "true"));
```

This flag controls `missing_bucket` in `/opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/aggregation/dsl/CompositeAggregationBuilder.java` (lines 72-74):

```java
if (bucketNullable) {
  sourceBuilder.missingBucket(true).missingOrder(missingOrder);
}
```

**The system does not analyze WHERE clause filters** (specifically `isnotnull()`) to automatically set `bucketNullable=false` for fields with non-null constraints.

### Tentative Proposed Fix

Enhance the aggregation analyzer to detect `isnotnull()` filters on fields being aggregated:

1. In `AggregateAnalyzer.analyze()`, extract fields with `isnotnull()` predicates from the Filter node
2. Pass this information to `createTermsSourceBuilder()`
3. Set `missingBucket(false)` when the aggregation field has an `isnotnull()` filter

**Suggested Code Change:**
```java
// Extract non-nullable fields from filter predicates
private static Set<String> extractNonNullableFields(Filter filterNode) {
  // Parse filter conditions to find isnotnull() predicates
  // Return set of field names that should not have missing buckets
}

// In createTermsSourceBuilder()
private static CompositeValuesSourceBuilder<?> createTermsSourceBuilder(
    String bucketName, RexNode group, AggregateBuilderHelper helper, 
    Set<String> nonNullableFields) {
  TermsValuesSourceBuilder termsBuilder = 
      new TermsValuesSourceBuilder(bucketName).order(SortOrder.ASC);
  
  String fieldName = extractFieldName(group);
  boolean shouldIncludeMissing = helper.bucketNullable 
      && !nonNullableFields.contains(fieldName);
  
  if (shouldIncludeMissing) {
    termsBuilder.missingBucket(true).missingOrder(MissingOrder.FIRST);
  }
  
  return helper.build(group, termsBuilder);
}
```

### Workaround

1. **Increase `ignore_above` limit** to accommodate all expected values
2. **Use keyword field directly** if text analysis is not needed
3. **Post-filter null values** in application code
4. **Remove `ignore_above`** if all values should be indexed

## Environment Information

**OpenSearch Version:**
OpenSearch 3.3.0-SNAPSHOT

**Additional Details:**
- Issue confirmed reproducible with minimal test case
- Related to PR #4327 which addressed similar issues for date/time span aggregations
- The logical plan does not show the `isnotnull()` condition explicitly, only `<>($2, '')`, but the physical plan correctly includes the `exists` filter

## Impact

This bug violates the semantic meaning of `isnotnull()` filter in aggregation queries, leading to incorrect query results. It particularly affects:

- **OpenTelemetry metrics** with long descriptions (common scenario)
- **Any text fields** with keyword subfields using `ignore_above`
- **Data analysis and reporting** where null exclusion is critical
- **Dashboard visualizations** that rely on accurate null filtering

The bug is subtle because the filter appears to work (exists check passes) but the aggregation produces incorrect results due to the field type mismatch between text and keyword subfields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] isnotnull() filter not applied in aggregation pushdown with text fields having keyword subfields #4463

Query Information

Dataset Information

Bug Description

Issue Summary

Preconditions for Bug to Occur

When Bug Will NOT Occur

Steps to Reproduce

Root Cause Analysis

Tentative Root Cause

Tentative Proposed Fix

Workaround

Environment Information

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] isnotnull() filter not applied in aggregation pushdown with text fields having keyword subfields #4463

Description

Query Information

Dataset Information

Bug Description

Issue Summary

Preconditions for Bug to Occur

When Bug Will NOT Occur

Steps to Reproduce

Root Cause Analysis

Tentative Root Cause

Tentative Proposed Fix

Workaround

Environment Information

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions