Skip to content

[FEATURE] Span() / bin should support auto_date_histogram aggregation #4210

@noCharger

Description

@noCharger

Is your feature request related to a problem?

Currently, PPL's span() function with fixed intervals doesn't provide the same flexibility as OpenSearch's native auto_date_histogram aggregation. This leads to:

  1. Suboptimal bucket distribution for time series analysis (as 1000 fixed)
  2. Need for manual tuning of time intervals
  3. Potential for too many or too few buckets
  4. Inconsistent results between PPL and DSL queries

For example, this PPL query uses a fixed 1-hour interval:

source = custom-big5 
| eval range_bucket = case(
    `metrics.size` < -10, 'range_1', 
    `metrics.size` >= -10 and `metrics.size` < 10, 'range_2',
    `metrics.size` >= 10 and `metrics.size` < 100, 'range_3',
    `metrics.size` >= 100 and `metrics.size` < 1000, 'range_4',
    `metrics.size` >= 1000 and `metrics.size` < 2000, 'range_5',
    `metrics.size` >= 2000, 'range_6'
) 
| stats min(`metrics.tmin`) as tmin, avg(`metrics.size`) as tavg, max(`metrics.size`) as tmax 
    by range_bucket, span(`@timestamp`, 1h) as auto_span

While the equivalent DSL query uses auto_date_histogram for optimal bucket distribution:

{
  "aggs": {
    "tmax": {
      "range": {
        "field": "metrics.size",
        "ranges": [...]
      },
      "aggs": {
        "date": {
          "auto_date_histogram": {
            "field": "@timestamp",
            "buckets": 20
          }
        }
      }
    }
  }
}

What solution would you like?

Implement auto_date_histogram support in PPL with:

New function syntax for auto-bucketing:

source = custom-big5 
| stats count() by auto_span(`@timestamp`, buckets=20)

Key features:

  • Dynamic interval selection based on data range
  • Target bucket count specification
  • Automatic interval rounding to human-readable values
  • Consistent bucket count regardless of time range

What alternatives have you considered?

Enhanced span() function with auto mode:

span(`@timestamp`, 'auto', target_buckets=20)

Hybrid approach with min/max intervals:

adaptive_span(`@timestamp`, buckets=20, min_interval='1m', max_interval='1d')

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagecalcitecalcite migration releatedenhancementNew feature or requestpushdownpushdown related issuesv3.3.0

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions