Skip to content

[BUG] Aggregation composite bucket size 10 cuts off time results for timechart #4009

@selsong

Description

@selsong

Query Information

PPL Command/Query:

source=events
| timechart span=1m avg(cpu_usage)

Expected Result:
Timechart query returns all composite buckets up to the configured aggregation limit (default 1000), so full data is retrieved across time intervals. There are 20 timestamps in the events.json dataset, so it should return all 20 timestamp buckets.

Actual Result:

curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
"query": "source=events | timechart span=1m avg(response_time)"
}'
{
  "schema": [
    {
      "name": "$f2",
      "type": "timestamp"
    },
    {
      "name": "$f1",
      "type": "double"
    }
  ],
  "datarows": [
    [
      "2024-07-01 00:00:00",
      120.0
    ],
    [
      "2024-07-01 00:01:00",
      85.0
    ],
    [
      "2024-07-01 00:02:00",
      200.0
    ],
    [
      "2024-07-01 00:03:00",
      150.0
    ],
    [
      "2024-07-01 00:04:00",
      95.0
    ],
    [
      "2024-07-01 00:05:00",
      75.0
    ],
    [
      "2024-07-01 00:06:00",
      180.0
    ],
    [
      "2024-07-01 00:07:00",
      110.0
    ],
    [
      "2024-07-01 00:08:00",
      160.0
    ],
    [
      "2024-07-01 00:09:00",
      90.0
    ]
  ],
  "total": 10,
  "size": 10
}%       
{
  "calcite": {
    "logical": "LogicalSystemLimit(sort0=[$0], dir0=[ASC], fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  LogicalSort(sort0=[$0], dir0=[ASC])\n    LogicalAggregate(group=[{1}], agg#0=[AVG($0)])\n      LogicalProject(response_time=[$6], $f2=[SPAN($1, 1, 'm')])\n        CalciteLogicalIndexScan(table=[[OpenSearch, events]])\n",
    "physical": "EnumerableLimit(fetch=[10000])\n  CalciteEnumerableIndexScan(table=[[OpenSearch, events]], PushDownContext=[[AGGREGATION->rel#203:LogicalAggregate.NONE.[](input=RelSubset#202,group={1},agg#0=AVG($0)), SORT->[0]], 
OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":10,\"sources\":[{\"$f2\":{\"date_histogram\":{\"field\":\"@timestamp\",\"missing_bucket\":true,\"missing_order\":\"last\",\"order\":\"asc\",\"fixed_interval\":\"1m\"}}}]},\"aggregations\":{\"$f1\":{\"avg\":{\"field\":\"response_time\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
  }
}%   

The query returns the first 10 timestamp buckets in the events.json dataset.
Without explicitly setting a composite aggregation size, OpenSearch defaults to size: 10 for buckets, which truncates results.

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "host": { "type": "keyword" },
      "service": { "type": "keyword" },
      "response_time": { "type": "integer" },
      "status_code": { "type": "integer" },
      "bytes_sent": { "type": "integer" },
      "cpu_usage": { "type": "float" },
      "memory_usage": { "type": "float" },
      "region": { "type": "keyword" },
      "environment": { "type": "keyword" }
    }
  }
}

Sample Data

There are only 3 lines provided here as an example but the number of timestamps in the dataset covers 20 seconds.

{ "index": {} }
{ "@timestamp": "2024-07-01T00:00:00Z", "host": "web-01", "service": "nginx", "response_time": 120, "status_code": 200, "bytes_sent": 1024, "cpu_usage": 45.2, "memory_usage": 67.8, "region": "us-east", "environment": "prod" }
{ "index": {} }
{ "@timestamp": "2024-07-01T00:01:00Z", "host": "web-02", "service": "nginx", "response_time": 85, "status_code": 200, "bytes_sent": 2048, "cpu_usage": 38.7, "memory_usage": 72.1, "region": "us-west", "environment": "prod" }
{ "index": {} }
{ "@timestamp": "2024-07-01T00:02:00Z", "host": "web-01", "service": "nginx", "response_time": 200, "status_code": 500, "bytes_sent": 512, "cpu_usage": 55.3, "memory_usage": 81.4, "region": "us-east", "environment": "prod" }
{ "index": {} }
{ "@timestamp": "2024-07-01T00:03:00Z", "host": "web-03", "service": "apache", "response_time": 150, "status_code": 200, "bytes_sent": 4096, "cpu_usage": 42.1, "memory_usage": 58.9, "region": "eu-west", "environment": "prod" }
...

Bug Description

Issue Summary:
When executing an aggregate query without a by clause, the composite aggregation size defaults to 10 in the pushdown phase, even though AGGREGATION_BUCKET_SIZE is set to 1000.

This happens because OpenSearchAggregateIndexScanRule pushes down an aggregation without overriding the composite.size in AbstractCalciteIndexScan.pushDownSortIntoAggBucket.

After adding:
.size(org.opensearch.sql.opensearch.storage.script.aggregation.AggregationQueryBuilder.AGGREGATION_BUCKET_SIZE)
in pushDownSortIntoAggBucket inside AbstractCalciteIndexScan.java, size now defaults to 1000 and all 20 timestamps in my query are returned correctly.

However, this change causes CalciteExplainIT.testExplain to fail because the expected EXPLAIN JSON still contains "size": 10.

Steps to Reproduce:

  1. Use feat/timechart_cmd branch from the linked PR.
  2. Index the provided dataset into an OpenSearch cluster.
  3. Run the query.

Impact:
In my timechart command, this causes missing time buckets for certain queries, producing incomplete visualizations.
Timechart Draft PR #3993
Timechart RFC #3965

Environment Information

OpenSearch Version:
Opensearch 3.2.0

Additional Details:
Change affects planner pushdown rules, specifically OpenSearchAggregateIndexScanRule and AbstractCalciteIndexScan.
Question: Should the integration test be updated to expect 1000 instead of 10, or should we keep 10 and handle the timestamps returned for timechart differently?

Screenshot:
This is the line of code I changed to increase composite bucket size to 1000. To reproduce the limit of 10, revert the code change and the query results will be cut off at 10.
Image

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't workingpushdownpushdown related issues

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions